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Abstract 



Randomized algorithms are often enjoyed for their simplicity, but the hash functions used 
to yield the desired theoretical guarantees are often neither simple nor practical. Here we show 
that the simplest possible tabulation hashing provides unexpectedly strong guarantees. 

The scheme itself dates back to Carter and Wegman (STOC'77). Keys are viewed as consist- 
ing of c characters. We initialize c tables T\,.,.,T C mapping characters to random hash codes. 
A key x = (x%, . . . , x c ) is hashed to Ti[x{\ © • • • © T c [x c ], where © denotes xor. 

While this scheme is not even 4- independent, we show that it. provides many of the guarantees 
that are normally obtained via higher independence, e.g., Chernoff-type concentration, min-wise 
hashing for estimating set intersection, and cuckoo hashing. 
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1 Introduction 



An important target of the analysis of algorithms is to determine whether there exist practical 
schemes, which enjoy mathematical guarantees on performance. 

Hashing and hash tables are one of the most common inner loops in real-world computation, and 
are even built-in "unit cost" operations in high level programming languages that offer associative 
arrays. Often, these inner loops dominate the overall computation time. Knuth gave birth to 



the analysis of algorithms in 1963 |Knu63 when he analyzed linear probing, the most popular 
practical implementation of hash tables. Assuming a perfectly random hash function, he bounded 
the expected number of probes. However, we do not have perfectly random hash functions. The 
approach of algorithms analysis is to understand when simple and practical hash functions work 
well. The most popular multiplication-based hashing schemes maintain the 0(1) running times 
when the sequence of operations has sufficient randomness |MV08| . However, they fail badly even 



for very simple input structures like an interval of consecutive keys [PPR09 PT10 , TZ09| , giving 
linear probing an undeserved reputation of being non-robust. 

On the other hand, the approach of algorithm design (which may still have a strong element 
of analysis) is to construct (more complicated) hash functions providing the desired mathemati- 
cal properties. This is usually done in the influential /c-independence paradigm of Wegman and 



Carter |WC81 . It is known that 5-independence is sufficient |PPR09 and necessary PT10| for lin- 



ear probing. Then one can use the best available implementation of 5-independent hash functions, 



the tabulation-based method of TZ04 TZ09 



Here we analyze simple tabulation hashing. This scheme views a key x as a vector of c characters 

x%, ... ,x c . For each character position, we initialize a totally random table Tj, and then use the 

hash function , „ . , , , 

h{x) = T 1 [x l ]@---@T c [x c \. 



This is a well-known scheme dating back at least to Wegman and Carter WC81 . From a practical 
view-point, tables Tj can be small enough to fit in fast cache, and the function is probably the 
easiest to implement beyond the bare multiplication. However, the scheme is only 3-independent, 
and was therefore assumed to have weak mathematical properties. We note that if the keys are 
drawn from a universe of size u, and hash values are machine words, the space required is 0(cu l / c ) 
words. The idea is to make this fit in fast cache. We also note that the hash values are bit strings, 
so when we hash into bins, the number of bins is generally understood to be a power of two. 

The challenge in analyzing simple tabulation is the significant dependence between keys. Nev- 
ertheless, we show that the scheme works in some of the most important randomized algorithms, 
including linear probing and several instances when f2(lgn)-independence was previously needed. 
We confirm our findings by experiments: simple tabulation is competitive with just one 64-bit 
multiplication, and the hidden constants in the analysis appear to be very acceptable in practice. 

In many cases, our analysis gives the first provably good implementation of an algorithm which 
matches the algorithm's conceptual simplicity if one ignores hashing. 

Desirable properties. We will focus on the following popular properties of truly random hash 
functions. 

• The worst-case query time of chaining is 0(lgra/ lglgn) with high probability (w.h.p.). More 
generally, when distributing balls into bins, the bin load obeys Chernoff bounds. 

• Linear probing runs in expected O(l) time per operation. Variance and all constant moments 
are also 0(1). 
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Cuckoo hashing: Given two tables of size m > (1 + e)n, it is possible to place a ball in one of 
two randomly chosen locations without any collision, with probability 1 — O(^). 

Given two sets A, B, we have Pr/Jmin h(A) = min h(B)] = njjjgj. This can be used to quickly 
estimate the intersection of two sets, and follows from a property called minwise independence: 
for any x ^ S, Pr/Jx < min h(S)] = rerpj- 



As defined by Wegman and Carter |WC81 in 1977, a family H = {h : [u] — > [to]} of hash 
functions is fc-independent if for any distinct x±, . . . , € [u], the hash codes h(x\), . . . , h(xk) are 
independent random variables, and the hash code of any fixed x is uniformly distributed in [to]. 



Chernoff bounds continue to work with high enough independence SSS95) ; for instance, in- 
dependence ® dg^T ra ) suffices for the bound on the maximum bin load. For linear probing, 5- 



independence is sufficient PPR09| and necessary PT10 . For cuckoo hashing, 0(lgn)-independence 

While minwise independence cannot 



CK09 



suffices and at least 6-independence is needed 

be achieved, one can achieve e-minwise independence with the guarantee (V)x 
mmh(S)] = t^tt- For this, 0(lg ^) independence is sufficient 



IndOl 



S,Pi h [x < 
and necessary [PTlOl. (Note 



|S|+r , - v-o E , 

that the e is a bias so it is a lower bound on how well set intersection can be approximated, with 
any number of independent experiments.) 

The canonical construction of /c-independent hash functions is a random degree k— 1 polynomial 
in a prime field, which has small representation but G(fc) evaluation time. Competitive implemen- 
tations of polynomial hashing simulate arithmetic modulo Mersenne primes via bitwise operations. 
Even so, tabulation-based hashing with 0(n 1 / c ) space and O(ck) evaluation time is significantly 
faster [TZ04 . The linear dependence on k is problematic, e.g., when k ~ lgn. 

Siegel Sie04 shows that a family with superconstant independence but 0(1) evaluation time re- 



quires Q(u e ) space, i.e. it requires tabulation. He also gives a solution that uses 0(v}l c ) space, 
evaluation time, and achieves vf 1 ^ 1 ' > independence (which is superlogarithmic, at least asymptot- 
ically). The construction is non- uniform, assuming a certain small expander which gets used in 
a graph product. Dietzfelbinger and Rink |DR09| use universe splitting to obtain similar high 
independence with some quite different costs. Instead of being highly independent on the whole 
universe, their goal is to be highly independent on an unknown but fixed set S of size n. For some 
constant parameter 7, they tolerate an error probability of n~ 7 . Assuming no error, their hash 
function is highly independent on S. The evaluation time is constant and the space is sublinear. 
For error probability n -7 , each hash computation calls 0(7) subroutines, each of which evaluates 
its own degree 0(7) polynomial. The price for a lower error tolerance is therefore a slower hash 
function (even if we only count it as constant time in theory) . 

While polynomial hashing may perform better than its independence suggests, we have no 
positive example yet. On the tabulation front, we have one example of a good hash function that 
is not formally /c-independent: cuckoo hashing works w ith an ad hoc hash function that combines 
space (^(n 1 ^) and polynomials of degree O(c) DW03 . 



1.1 Our results 

Here we provide an analysis of simple tabulation showing that it has many of the desirable properties 
above. For most of our applications, we want to rule out certain obstructions with high probability. 
This follows immediately if certain events are independent, and the algorithms design approach is 
to pick a hash function guaranteeing this independence, usually in terms of a highly independent 
hash function. 
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Instead we here stick with simple tabulation with all its dependencies. This means that we have 
to struggle in each individual application to show that the dependencies are not fatal. However, 
from an implementation perspective, this is very attractive, leaving us with one simple and fast 
scheme for (almost) all our needs. 

In all our results, we assume the number of characters is c = 0(1). The constants in our 
bounds will depend on c. Our results use a rather diverse set of techniques analyzing the table 
dependencies in different types of problems. For chaining and linear probing, we rely on some 
concentration results, which will also be used as a starting point for the analysis of min-wise 
hashing. Theoretically, the most interesting part is the analysis for cuckoo hashing, with a very 
intricate study of the random graph constructed by the two hash functions. 

Chernoff bounds. We first show that simple tabulation preserves Chernoff-type concentration: 

Theorem 1. Consider hashing n balls into m > n 1_1 /( 2c ) bins by simple tabulation. Let q be an 
additional query ball, and define X q as the number of regular balls that hash into a bin chosen as 
a function of h{q). Let \i = ~E[X q ] = — . The following probability bounds hold for any constant 7: 

(V)<5 < 1 : Pr[|X 9 -n\> 5fi] < 2e~ a{ - 52 ^ + m~ 7 (1) 
(V)<5 = 0(1) : Pi[X q > (1 + S)n] < (1 + 5)- n « 1+5 ^) + m -7 (2) 

With m < n bins, every bin gets 

n /m ±0 {^\Jnjm log c n^j . (3) 

keys with probability 1 — n~ 7 . 



Contrasting standard Chernoff bounds (see, e.g., M R95| ), Theorem [I] can only provide polyno- 



mially small probability, i.e. at least re -7 for any desired constant 7. In addition, the exponential 
dependence on {l in ([!]) and ^ is reduced by a constant which depends (exponentially) on the con- 
stants 7 and c. It is possible to get some super polynomially small bounds with super constant 7 but 
they are not as clean. An alternative way to understand the bound is that our tail bound depends 
exponentially on e/i, where e decays to subconstant as we move more than inversely polynomial 
out in the tail. Thus, our bounds are sufficient for any polynomially high probability guarantee. 
However, compared to the standard Chernoff bound, we would have to tolerate a constant factor 
more balls in a bin to get the same failure probability. 

By the union bound implies that with m = 0(n) bins, no bin receives more than 
0(lgn/lglgn) balls w.h.p. This is the first realistic hash function to achieve this fundamental 
property. Similarly, for linear probing with fill bounded below 1, ([2]) shows that the longest filled 
interval is of length O(logn) w.h.p. 

Linear probing. Building on the above concentration bounds, we show that if the table size is 
m = (1 + e)n, then the expected time per operation is 0(l/e 2 ), which asymptotically matches the 



bound of Knuth Knu63 for a truly random function. In particular, this compares positively with 



the 0(l/e 13//6 ) bound of PPR09 for 5-independent hashing. 

Our proof is a combinatorial reduction that relates the performance of linear probing to concen- 
tration bounds. The results hold for any hash function with concentration similar to Theorem [I 



PPR09 



To illustrate the generality of the approach, we also improve the 0(l/e 13 / 6 ) bound from 
for 5-independent hashing to the optimal 0(l/e 2 ). This was raised as an open problem in PPR09 . 
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For simple tabulation, we get quite strong concentration results for the time per operation, e.g„ 
constant variance for constant e. For contrast, with 5-independent hashing, the variance is only 
known to be O(logra) |PPR09[|TZ09l . 



Cuckoo hashing. In general, the cuckoo hashing algorithm fails iff the random bipartite graph 
induced by two hash functions contains a component with more vertices than edges. With truly 
random hashing, this happens with probability 0(— ). Here we study the random graphs induced by 
simple tabulation, and obtain a rather unintuitive result: the optimal failure probability is inversely 
proportional to the cube root of the set size. 

Theorem 2. Any set ofn keys can be placed in two table of size m = (1+e) by cuckoo hashing and 
simple tabulation with probability 1 — 0(n -1 / 3 ). There exist sets on which the failure probability is 

Thus, cuckoo hashing and simple tabulation are an excellent construction for a static dictionary. 
The dictionary can be built (in linear time) after trying 0(1) independent hash functions w.h.p., 
and later every query runs in constant worst-case time with two probes. We note that even though 
cuckoo hashing requires two independent hash functions, these essentially come for the cost of one 
in simple tabulation: the pair of hash codes can be stored consecutively, in the same cache line, 
making the running time comparable with evaluating just one hash function. 

In the dynamic case, Theorem [2] implies that we expect il(n 4//3 ) updates between failures re- 
quiring a complete rehash with new hash functions. 

Our proof involves a complex understanding of the intricate, yet not fatal dependencies in simple 
tabulation. The proof is a (complicated) algorithm that assumes that cuckoo hashing has failed, 
and uses this knowledge to compress the random tables T\, . . . ,T C below the entropy lower bound. 

Using our techniques, it is also possible to show that if n balls are placed in 0{n) bins in 
an online fashion, choosing the least loaded bin at each time, the maximum load is O(lglgn) in 
expectation. 

Minwise independence. In the full version, we show that simple tabulation is e-minwise in- 
dependent, for a vanishingly small e (inversely polynomial in the set size). This would require 
0(logn) independence by standard techniques. 

Theorem 3. Consider a set S of n = \S\ keys and q ^ S. Then with h implemented by simple 
tabulation: 

l±e / lg 2 n 

Pi[h(q) < min h(S)] = , where e = O 



n \ n x l c 

This can be used to estimate the size of set intersection by estimating: 



Pr[min/i(,4) =min/i(S)] 

= Pt[x < min h(A U B \ {x})] 



c&AnB 

lACBl f -f I 



\A\JB\ V \\AUB\ 1 / C 

For good bounds on the probabilities, we would make multiple experiments with independent hash 
functions. An alternative based on a single hash function is that we for each set consider the 
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k elements with the smallest hash values. We will also present concentration bounds for this 
alternative. 

Fourth moment bounds. An alternative to Chernoff bounds in proving good concentration is 
to use bounded moments. In the full version of the paper, we analyze the 4 th moment of a bin's 
size when balls are placed into bins by simple tabulation. For a fixed bin, we show that the 4 th 
moment comes extremely close to that achieved by truly random hashing: it deviates by a factor 
of 1 + 0(4 c /m), which is tiny except for a very large number of characters c. This would require 
4-independence by standard arguments. This limited 4 th moment for a given bin was discovered 



independently by BCL + 10 



If we have a designated query ball q, and we are interested in the size of a bin chosen as a 
function of h(q), the 4 th moment of simple tabulation is within a constant factor of that achieved 
by truly random hashing (on close inspection of the proof, that constant is at most 2). This would 



require 5-independence by standard techniques. (See [PT10 for a proof that 4-independence can 



fail quite badly when we want to bound the size of the bin in which q lands.) Our proof exploits 
an intriguing phenomenon that we identify in simple tabulation: in any fixed set of 5 keys, one of 
them has a hash code that is independent of the other four's hash codes. 

Unlike our Chernoff-type bounds, the constants in the 4 th moment bounds can be analyzed quite 
easily, and are rather tame. Compelling applications of 4 th moment bounds were given by KR93 



and Tho09 . In KR93 , it was shown that any hash function with a good 4 th moment bound 



suffices for a nonrecursive version of quicksort, routing on the hypercube, etc. In Tho09 , linear 
probing is shown to have constant expected performance if the hash function is a composition of 
universal hashing down to a domain of size 0(n), with a strong enough hash function on this small 
domain (i.e. any hash function with a good 4 th moment bound). 

We will also use 4 th moment bounds to attain certain bounds of linear probing not covered by 
our Chernoff-type bounds. In the case of small fill a = = o(l), we use the 4 th moment bounds 
to show that the probability of a full hash location is 0(a). 

Pseudorandom numbers. The tables used in simple tabulation should be small to fit in the 
first level of cache. Thus, filling them with truly random numbers would not be difficult (e.g. in 
our experiments we use atmospheric noise from random.org). If the amount of randomness needs 
to be reduced further, we remark that all proofs continue to hold if the tables are filled by a 
0(lg n)-independent hash function (e.g. a polynomial with random coefficients). 

With this modification, simple tabulation naturally lends itself to an implementation of a very 
efficient pseudorandom number generator. We can think of a pseudorandom generator as a hash 
function on range [n], with the promise that each h(i) is evaluated once, in the order of increasing i. 
To use simple tabulation, we break the universe into two, very lopsided characters: [-g] x [R], for R 
chosen to be B(lgn). Here the second coordinate is least significant, that is, (x, y) represents xR+y. 
During initialization, we fill T2W ■ ■ R] with R truly random numbers. The values of T\[l . . n/R] are 
generated on the fly, by a polynomial of degree 0(lgn), whose coefficients were chosen randomly 
during initialization. Whenever we start a new row of the matrix, we can spend a relatively large 
amount of time to evaluate a polynomial to generate the next value n which we store in a register. 
For the next R calls, we run sequentially through T2, xoring each value with r\ to provide a new 
pseudorandom number. With T2 fitting in fast memory and scanned sequentially, this will be 
much faster than a single multiplication, and with R large, the amortized cost of generating r\ 
is insignificant. The pseudorandom generator has all the interesting properties discussed above, 
including Chernoff-type concentration, minwise independence, and random graph properties. 
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Experimental evaluation. We performed an experimental evaluation of simple tabulation. Our 
implementation uses tables of 256 entries (i.e. using c = 4 characters for 32-bit data and c = 8 
characters with 64-bit data). The time to evaluate the hash function turns out to be competitive 
with multiplication-based 2-independent functions, and significantly better than for hash functions 
with higher independence. We also evaluated simple tabulation in applications, in an effort to 
verify that the constants hidden in our analysis are not too large. Simple tabulation proved very 
robust and fast, both for linear probing and for cuckoo hashing. 

Notation. We now introduce some notation that will be used throughout the proofs. We want 
to construct hash functions h : [u] — > [m]. We use simple tabulation with an alphabet of £ and 
c = 0(1) characters. Thus, u = S c and h(xi, . . . ,x c ) = ©i=i ^[^i]- ^ is convenient to think of 
each hash code Ti[xi] as a fraction in [0, 1) with large enough precision. We always assume m is a 
power of two, so an m-bit hash code is obtained by keeping only the most significant log 2 m bits 
in such a fraction. We always assume the table stores long enough hash codes, i.e. at least log 2 m 
bits. 

Let S C S c be a set of \S\ = n keys, and let q be a query. We typically assume q ^ 5, since 
the case q E S only involves trivial adjustments (for instance, when looking at the load of the bin 
h(q), we have to add one when q £ S). Let n(S,i) be the projection of S on the i-th coordinate, 
7r(5, i) = {x{ | (V)x G S}. 

We define a position- character to be an element of [c] x S. Then, the alphabets on each 
coordinate can be assumed to be disjoint: the first coordinate has alphabet {1} x S, the second has 
alphabet {2} x S, etc. Under this view, we can treat a key x as a set of q position-characters (on 
distinct positions). Furthermore, we can assume h is defined on position characters: h((i,a)) = 
Ti[a\. This definition is extended to keys (sets of position-characters) in the natural way h{x) = 
(& aex h(a). 

When we say with high probability in r, we mean 1 — r a for any desired constant a. Since 
c = O(l), high probability in |S| is also high probability in u. If we just say high probability, it is 
understood to be in n. 

2 Concentration Bounds 

This section proves Theorem [TJ except branch ^ which is shown in the full version of the paper. 

If n elements are hashed into n 1+e bins by a truly random hash function, the maximum load of 
any bin is O(l) with high probability. First we show that simple tabulation preserves this guarantee. 
Building on this, we shows that the load of any fixed bin obeys Chernoff bounds. Finally we show 
that the Chernoff bound holds even for a bin chosen as a function of the query hash code, h(q). 

As stated in the introduction, the number of bins is always understood to be a power of two. 
This is because our hash values are xor'ed bit strings. If we want different numbers of bins we could 
view the hash values as fractions in the unit interval and divide the unit interval into subintervals. 
Translating our results to this setting is standard. 

2.1 Hashing into Many Bins 

The notion of peeling lies at the heart of most work in tabulation hashing. If a key from a set of 
keys contains one position-character that doesn't appear in the rest of the set, its hash code will be 
independent of the rest. Then, it can be "peeled" from the set, as its behavior matches that with 
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truly random hashing. More formally, we say a set T of keys is peelable if we can arrange the keys 
of T in some order, such that each key contains a position-character that doesn't appear among 
the previous keys in the order. 

Lemma 4. Suppose we hash n < m l ~ e keys into m bins, for some constant e > 0. For any 
constant 7, all bins get less than d = min {((1 + 7)/e) c , 2( 1+7 )/ £ } keys with probability > 1 — m~ 7 . 

Proof. We will show that among any d elements, one can find a peelable subset of size t > 
maxj^/^lgfi}. Then, a necessary condition for the maximum load of a bin to be at least d is 
that some bin contain t peelable elements. There are at most (") < n t such sets. Since the hash 
codes of a peelable set are independent, the probability that a fixed set lands into a common bin 
is l/m* -1 . Thus, an upper bound on the probability that the maximum load is d can be obtained: 
n*/m* _1 = mS 1-6 ^ /m t_1 = m 1 " 5 '. To obtain failure probability m~ 7 , set t = (1 + 7)/e. 

It remains to show that any set T of |T| = d keys contains a large peelable subset. Since 
T C vr(T, 1) x ••• x it(T,c), it follows that there exists i G [c] with \n(T,i)\ > d 1/c . Pick some 
element from T for every character value in ir(S, i); this is a peelable set of t = d l / c elements. 

To prove t > log 2 d, we proceed iteratively. Consider the coordinate giving the largest projection, 
j = argmaxj \tt(T, As long as \T\ > 2, |vr(T, j)\ > 2. Let a be the most popular value in T 
for the j-th character, and let T* contain only elements with a on the j-th coordinate. We have 
\T*\ > \T\/\ir(T, In the peelable subset, we keep one element for every value in 7r(T,j) \ {a}, 
and then recurse in T* to obtain more elements. In each recursion step, we obtain k > 1 elements, 
at the cost of decreasing log 2 \T\ by log 2 (/c + 1). Thus, we obtain at least log 2 d elements overall. □ 

We note that, when the subset of keys of interest forms a combinatorial cube, the probabilistic 
analysis in the proof is sharp up to constant factors. In other words, the exponential dependence 
on c and 7 is inherent. 

2.2 Chernoff Bounds for a Fixed Bin 

We study the number of keys ending up in a prespecified bin B. The analysis will define a total 
ordering -< on the space of position-characters, [c] x S. Then we will analyze the random process 
by fixing hash values of position-characters h(a) in the order -<. The hash value of a key x G S 
becomes known when the position-character max^ x is fixed. For a G [c] x S, we define the group 
G a = {x G S j a = max^ x}, the set of keys for whom a is the last position-character to be fixed. 

The intuition is that the contribution of each group G a to the bin B is a random variable 
independent of the previous G^'s, since the elements G a are shifted by a new hash code h(a). 
Thus, if we can bound the contribution of G a by a constant, we can apply Chernoff bounds. 

Lemma 5. There is an ordering -< such that the maximal group size is max a \G a \ < n 1_1 / c . 

Proof. We start with S being the set of all keys, and reduce S iteratively, by picking a position- 
character a as next in the order, and removing keys G a from S. At each point in time, we pick the 
position-character a that would minimize \G a \. Note that, if we pick some a as next in the order, 
G a will be the set of keys x G S which contain a and contain no other character that hasn't been 
fixed: (V)/? G x \ {a}, /3 -< a. 

We have to prove is that, as long as S 7^ 0, there exists a with \G a \ < {Sl 1 ^ 1 ^ . If some position 
i has 1^(5, i)\ > l^l 1 ^, there must be some character a on position i which appears in less than 
l^i-Vc keys; thus \G a \ < S 1 ^ 1 ^ . Otherwise, ir(S,i) < jS*) 1 ^ for all i. Then if we pick an arbitrary 
character a on some position i, have \G a \ < IL^i |7r( < S', j)| < (|<S'| 1 / C ) C_1 = IS"! 1 " 1 ^. □ 
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From now on assume the ordering -< has been fixed as in the lemma. This ordering partitions 
S into at most n non-empty groups, each containing at most n 1_1 / c keys. We say a group G a is 
d-bounded if no bin contains more than d keys from G a . 

Lemma 6. Assume the number of bins is m > 1 '( 2c ). For any constant 7, with probability 
> 1 — m -7 , all groups are d-bounded where 

d = min 

Proof. Since \G a \ < n l ~ l l c < m l ~ l K 2c \ by Lemma [4J we get that there are at most d keys from 
G a in any bin with probability 1 — m~( 2+7 ) > 1 — ?n~ 7 /n. The conclusion follows by union bound 
over the < n groups. □ 



Henceforth, we assume that 7 and d are fixed as in Lemma [6j Chernoff bounds (see [|MR95 



Theorem 4.1]) consider independent random variables X\,X2,--- £ [0, d]. Let X = Yli-X-i, A 4 
E[X], and 5 > 0, the bounds are: 

/ 8 \ v/d 



Pr[X < (1 - 5)n\ < 



(i + ^-*>, (4) 

e -s \»/ d 1 ' 



(1 -*)(!-«) 



Let X a be the number of elements from G a landing in the bin B. We are quite close to applying 
Chernoff bounds to the sequence X a , which would imply the desired concentration around \x = ~. 
Two technical problems remain: X^s are not (f-bounded in the worst case, and they are not 
independent. 

To address the first problem, we define the sequence of random variables X a as follows: if 
G a is d-bounded, let X a = X a ; otherwise X a = \G a \/m is a constant. Observe that Yl a ^ a 
coincides with ^ Q X a if all groups are d-bounded, which happens with probability 1 — m~ 7 . Thus 
a probabilistic bound on ^ Q X a is a bound on X a up to an additive m -7 in the probability. 

Finally, the X a variables are not independent: earlier position-character dictate how keys clus- 
ter in a later group. Fortunately Q holds even if the distribution of each Xi is a function of 
Xi, . . . , Xi-i, as long as the mean E[Xj | X\, . . . ,Xi—x] is a fixed constant fii independent of 
Xi, ...,Xi-i. A formal proof will be given in Appendix [B| We claim that our means are fixed this 
way: regardless of the hash codes for (3 < a, we will argue that E[X Q ] = \i a = \G a \/m. 

Observe that whether or not G a is (i-bounded is determined before h(a) is fixed in the order -<. 
Indeed, a is the last position-character to be fixed for any key in G a , so the hash codes of all keys 
in G a have been fixed up to an xor with h(a). This final shift by h(a) is common to all the keys, 
so it cannot change whether or not two elements land together in a bin. Therefore, the choice of 
h(a) does not change if G a is (i-bounded. 

After fixing all hash codes (3 -< a, we decide if G a is d-bounded. If not, we set X a = \G a \/m. 
Otherwise X a = X a is the number of elements we get in B when fixing h(a), and h(a) is a uniform 
random variable sending each element to B with probability 1/m. Therefore E[X a ] = \G a \/m. 
This completes the proof that the number of keys in bin B obeys Chernoff bounds from Q , which 
immediately imply ([T]) and Q in Theorem [TJ 
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2.3 The Load of a Query-Dependent Bin 



When we are dealing with a special key q (a query), we may be interested in the load of a bin 
B q , chosen as a function of the query's hash code, h{q). We show that the above analysis also 
works for the size of B q , up to small constants. The critical change is to insist that the query 
position-characters come first in our ordering -<: 

Lemma 7. There is an ordering -< placing the characters of q first, in which the maximal group 
size is 2 ■ n 1_1 / c . 

Proof. After placing the characters of q at the beginning of the order, we use the same iterative 
construction as in Lemma [5| Each time we select the position-character a minimizing \G a \, place a 
next in the order -<, and remove G a from S. It suffices to prove that, as long as S ^ 0, there exists 
a position-character a q with \G a \ < 2 • |5| 1_1 / c . Suppose in some position i, \ir(S, i)\ > {SI 1 ' . 
Even if we exclude the query character qt, there must be some character a on position i that 
appears in at most |S'|/(|7r(S', i)\ — 1) keys. Since 5^0, {Sl 1 ^ > 1, so \tt(S, i)\ > 2. This means 
| vr(,S', z) | — 1 > \S\ 1 / c /2, so a appears in at most 2|S| 1_1 / C keys. Otherwise, we have 7r(S,i) < jSj 1 ^ 
for all i. Then, for any character a on position i, we have \G a \ < Uj^i \n(S,j)\ < IS) 1 " 1 ^. □ 

The lemma guarantees that the first nonempty group contains the query alone, and all later 
groups have random shifts that are independent of the query hash code. We lost a factor two 
on the group size, which has no effect on our asymptotic analysis. In particular, all groups are 
c?-bounded w.h.p. Letting X a be the contribution of G a to bin B q , we see that the distribution of 
X a is determined by the hash codes fixed previously (including the hash code of q, fixing the choice 
of the bin B q ). But ELYJ = \G a \/m holds irrespective of the previous choices. Thus, Chernoff 
bounds continue to apply to the size of B q . This completes the proof of ([I]) and ^ in Theorem [I] 

In Theorem [l] we limited ourselves to polynomially small error bounds m~ 7 for constant 7. 
However, we could also consider a super constant 7 = w(l) using the formula for d in Lemma [6j 
For the strongest error bounds, we would balance m -7 with the Chernoff bounds from Q. Such 
balanced error bounds would be messy, and we found it more appealing to elucidate the standard 
Chernoff- style behavior when dealing with polynomially small errors. 



2.4 Few bins 



We will now settle Theorem T] ^ , proving some high probability bounds for the concentration with 
m < n bins. As stated in (13), we will show, w.h.p., that the number of keys in each bin is 



n 



/m± 0(y/ n/m log c 



n 



Consider any subset S of s < n keys that only vary in b characters. Generalizing pi), we will show 
for any L > 32 log n, that with probability 1 — exp(— f2(L)), the keys in S get distributed with 



s/m± y / s/mL b \is>mL b /2 
< L b if s < mL b /2 



(5) 



keys in each of the m bins. This is trivial for m = 1, so we can assume m > 2. The proof is by 
induction on (b,s). First we will prove that each inductive step fails with small probability. Later 
we will conclude that the combined failure probability for the whole induction is small. 
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For the base case of the induction, if s < L b , the result is trivial since it holds even if some bin 
gets all the keys from S. This case includes if we have no characters to vary, that is, when s = 1 
and 6 = 0. We may therefore assume that s > L b , and b > 0. The characters positions where S do 
not vary will only shuffle the bins, but not affect which keys from S go together, so we can ignore 
them when giving bounds for the sizes of all bins. 



Considering the varying characters in S, we apply the ordering from Section 2.2 leading to a 
grouping of S. By Lemma[5j there is an ordering -< such that the maximal group size is max a \ G a \ < 
s l ~ l l h . In particular, maXclGal < s/L. 

First, assume that s < mL b /2. Each group has one less free character, so by induction, each 



2.2 



group has at most L b ~ l keys in each bin, that is, each group is L fe_1 -bounded. Now as in Section 
for any fixed bin, we can apply the Chernoff upper-bound from Q with d = L 6-1 . We have 
[i = s/m < L b /2, and we want to bound the probability of getting a bin of size at least x = L b > 2\x. 
For an upper bound, we use // = x/2 > fj, and 5' = 1, and get a probability bound of 



e \ 



(1 + $/)(!+*') 



( e /4)A«7d < ( e /4) L / 2 . 



With the union bound, the probability that any bin has more than L b keys is bounded by m(e/4) L / 2 . 

Partitioning many keys. Next, we consider the more interesting case where s > mL b /2. As 
stated i n (|5|) , we want to limit the probability that the contribution S to any bin deviates by more 
than \J ' s/mL b from the mean s/m. We partition the groups into levels i based on their sizes. On 
level we have the groups of size up to mL b ~ 1 /2. On level i > 0, we have the groups of size 
between tj = mL b ~ 1 2 l ~ 2 and 2tj. For each i, we let Si denote the union of the level i groups. We 
are going to handle each level i separately, providing a high probability bound on how much the 
contribution of Si to a given bin can deviate from the mean. Adding the deviations from all levels, 
we bound the total deviation in the contribution from S to this bin. Let Sj be the number of keys 
in Si and define 

\ = ^s~J^L h - l l 2 . (6) 
For level i > 0, we will use Aj as our deviation bound, while we for i = 0, will use Ao = max{Ao, L b }. 
The total deviation. We will now show that the above level deviation bounds provide the 



desired total deviation bound of \J sjmL b from ([5j). Summing over the levels, the total deviation is 
bounded by L b + ^ yj Si/m L b ~ 1 / 2 . To bound the sum, we first consider the smaller terms where 
Si < s/logn. Then ^Js~iJmL b ~ l l 2 < ^Js/mL b / 'y/L logn. We have at most logn values of i, so these 
smaller terms sum to at most \J ' s/mL b yJ (logn)/L. 

Next we consider the larger terms where Si > s/logn. Each such term can be bounded as 



^s~J^L b - 1 l 2 = ((Si/m)/xfs~/m^ L^ 1 / 2 

< ((ai/m)/y/a/m) L b yJ\ogn/L. 



The sum of the larger terms is therefore also bounded by <J ' s/mL b \J (logn)/L. Thus the 
total deviation is bounded by L b + yf s/mL b 2yJ (logn)/L. Assuming L > 9 logn, we have 
2yJ (log n)/L < 2/3. Moreover, with n > 4 and b > 1, we have s/m > L b /2 > 9. It follows 
that L b + 2yJ (log n)/LyJ s/mL b < y/s/mL b , as desired. 
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Deviation from small groups. We now consider the contribution to our bin from the small 
groups in So- These groups have size most (mL b /2) 1 ~ 1 l b < mL fe_1 /2, and 6 — 1 free character 
positions, so inductively from each group contributes at most do < L b ~ l to each bin. We want 
to bound the probability that the deviation from the mean no = so/m exceeds Ao = max{Ao, L b }. 

Suppose fj>o < Ao- For a Chernoff upper bound, we use fjf = Ao > no and 8' = 1, and get a 
probability bound of 

(e s '/{l + 8>)^y' /d ° = (e/4)"V* < (e/4) L . 

On the other hand, if ^o > Ao, we have a relative deviation of 5q = Aq/^o = \J m/ soL b ~ x l 2 < 1. 
The probability of this deviation for any fixed bin is bounded by 

;*>/(l + 8o)^ +5o) y 0/d ° < exp(-( M oM)5 2 /3) = exp(-L b /3) < exp(-L/3). 

Larger groups. To deal with a larger group level i > 1, we will use a standard symmetric Chernoff 
bound, which is easily derived from the negative version of Q. We consider independent n random 
variables Xi, ....,X n £ [—d, d], each with mean zero. Let X = JQ. For any 5 > 0, 

Pr[|X| > Sdn] < 2exp(-n<5 2 /4) (7) 



As we did for Q in Section 2.2 we note that ([7]) also holds when X$ depends on the previous Xj 



j < i, as long as \Xi\ < d and E[Xj] = 0. Back to our problem, let S{ be the total size. Each group 
G has size at least tj = mL b ~ 1 2 t ~ 2 , so we have at most n, = Sijti groups. The group G has only 
6—1 varying characters and tj > t\ = mL b ~ 1 /2, so inductively from ([H]), the contribution of G to 
any bin deviates by at most d{ = \J\G\/mL b ~ 1 < y / 2ti/mL b ~ 1 from the mean \G\/m. We let Xq 
denote the contribution of G to our bin minus the mean \G\/m. Thus, regardless of the distribution 
of previous groups, we have E[X(j] = and \Xq\ < d{. We want to bound the probability that 
\Y jG X\> Ai.^Nz therefore apply Q with 

h = Ai/idim) = v^M^ 1/2 / {^/2t i /mL b - l s i /t^ = y/UL/{2si). 

The probability that the contribution to our bin deviates by more than Aj is therefore bounded by 

2exp(-n4 2 /4) = 2exp(-s i /t i • y 'UL , >(2 Si ) '/ '4) = 2exp(-L/8). 

Conveniently, this dominates the error probabilities of (e/4) L and exp(— L/3) from level 0. There 
are less logra levels, so by the union bound, the probability of a too large deviation from any level 
to any bin is bounded by m(logn)2 exp(— L/8). 

Error probability for the whole induction. Above we proved that any particular inductive 
step fails with probability at most m(logn)2exp(— L/8). We want to conclude that the probability 
of any failure in the whole induction is bounded by nmexp(— L/8). 

First we note that the all the parameters of the inductive steps are determined deterministically. 
More precisely, the inductive step is defined via the deterministic grouping from Lemma [5} This 
grouping corresponds to a certain deterministic ordering of the position characters, and we use 
this ordering to analyze the failure probability of the inductive step. However, there is no relation 
between the ordering used to analyze different inductive steps. Thus, we are dealing with a recursive 



12 



deterministic partitioning. Each partitioning results in groups that are at least L times smaller, 
so the recursion corresponds to a tree with degrees at least L. At the bottom we have base cases, 
each containing at least one key. The internal nodes correspond to inductive steps, so we have less 
than 2n/L of these. If L > 41ogn, we conclude that the combined failure probability is at most 
2n/Lm(logn)2exp(— L/8) < nmexp(— L/8). With L > 321ogn, we get that the overall failure 
probability is bounded by exp(— L/64). This completes our proof that ([5]) is satisfied with high 
probability, hence the proof of Theorem [I] ^ . 



3 Linear Probing and the Concentration in Arbitrary Intervals 

We consider linear probing using simple tabulation hashing to store a set S of n keys in an array of 
size m (as in the rest of our analyses, m is a power of two). Let a = 1 — e = ^ be the fill. We will 
argue that the performance with simple tabulation is within constant factors of the performance 
with a truly random function, both in the regime e > 1/2 (high fill) and a < 1/2 (low fill). With 
high fill, the expected number of probes when we insert a new key is 0(1/ e 2 ) and with low fill, it 
is 1 + 0(a). 



Pagh et al. PPR09 presented an analysis of linear probing with 5-independent hashing using 
4 th moment bounds. They got a bound of 0(l/e 13 / 6 ) on the expected number of probes. We feel 
that our analysis, which is centered around dyadic intervals, is simpler, tighter, and more generic. 
Recall that a dyadic interval, is an interval of the form [j2 l , (j + 1)2*) for integers i and j. In fact, 
as we shall see later in Section 6.4, our analysis also leads to an optimal 0(1/ e 2 ) for 5-independent 



hashing, settling an open problem from PPR09|. However, with simple tabulation, we get much 



stronger concentration than with 5-independent hashing, e.g., constant variance with constant e 
whereas the variance is only known to be O(logn) with 5-independent hashing. 

When studying the complexity of linear probing, the basic measure is the length R = R(q, S) 
of the longest run of filled positions starting from h(q), that is, positions h(q), h(q) + I — 1 are 
filled with keys from S while h(q) + R is empty. This is the case if and only if R is the largest 
number such there is an interval / which contains h(q) and h(q) + R — 1 and such that / is full 
in the sense that at least |/| keys from S hash to /. In our analysis, we assume that q is not in 
the set. An insert or unsuccessful search with q will consider exactly R + 1 positions. A successful 
search for q will consider at most R(q, S \ {q}) + 1 positions. For deletions, the cost is R(q, S) + 1 
but where q G S. For now we assume q S, but we shall return to the case q £ S in Section |3.1| 

Aiming for upper bounds on R(q, S), it is simpler to study the symmetric length L(q, S) of the 
longest filled interval containing h(q). Trivially R(q,S) < L(q,S). We have n = \S\ keys hashed 
into m positions. We defined the fill a = n/m and e = (1 — a). The following theorem considers 
the case of general relative deviations 5. To bound Pr[L(q, S) > £], we can apply it with p = h(q) 
and 5 = s or (1 + 8) = 1/a. 

Theorem 8. Consider hashing a set of n keys into {0, ...,m — 1} using simple tabulation (so m is 
a power of two). Define the fill a = n/m. Let p be any point which may or may not be a function of 
the hash value of specific query key not in the set. Let L)i^„ be the event that there exists an interval 
I containing p and of length at least I such that the number of keys Xj in I deviates at least 5 
from the mean, that is, \Xj — a\I\\ > <5a|/|. Suppose at < n l l^ c \ or equivalently, m/l > n 1-1 /^. 
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Then for any constant 7, 

f 2e-^ 2 ) + (^/m)T */ <5 < 1 , , 

^r[^ Ap j < j (1 + 5) -n((i +(5 )^) + Wm)7 if s = m W 

Moreover, with probability 1 — n~ 7 , /or every interval I, if a\I\ > 1, f/ie number of keys in I is 

a\I\ ±0 (Vc^log c n) . (9) 

Theorem [8] is a very strong generalization of Theorem [TJ A bin from Theorem [8] corresponds 
to a specific dyadic interval of length t = 2 l (using m! = m/2 l in Theorem [81). In Theorem [8] we 
consider every interval of length at least I which contains a specific point, yet we get the same 
deviation bound modulo a change in the constants hidden in the O-notation. 

To prove the bound on T>e t s tP , we first consider the weaker event Ci t s >p for integer i that there 
exists an interval I 3 p, 2 l < \I\ < 2 l+1 , such that the relative deviation 5 in the number of keys 
Xj is at least 5. As a start we will prove that the bound from Q holds for Ci : s, p - Essentially 
Theorem [8] will follow because the probability bounds decrease exponentially in i. 

When bounding the probability of Cis, P , we will consider any i such that a2 % < n 1 ^ 2 ^ whereas 
we in Theorem [8] only considered at < n l /^ c \ The constraint a2 l < n 1 ^ 2 ^ matches that in 
Theorem [l] with ml = m/2 i . In Theorem [l] we required m! > n, 1_1 /( 2c ) <^=^ n/m' = a2 l < n l ^ 2c \ 

Our proof is based on decompositions of intervals into dyadic intervals. To simplify the ter- 
minology and avoid confusion between intervals and dyadic intervals, we let a bin on level j, or 
for short, a j-bin, denote a dyadic interval of length 2 3 . The expected number of keys in a j-bin 
is Hj = a2K The j-bins correspond to the bins in Theorem [l] with m! = mj2K For any j < i, 
consider the j-bin containing p, and the 2 l+1 ~i j-bins on either side. We say that these 2 i+2 ~ J + 1 
consecutive j-bins are relevant to Ci^ p noting that they cover any I B p, \I\ < 2 i+1 . 

5 = 0(1). To handle 5 = 0,(1), we will use the following combinatorial claim that holds for any S. 

Claim 9. Let j be maximal such that 2 J < 1+ ^ 2 2 I ~ 2 . If Ci t s, p happens, then one of the relevant 
j-bins contains more than (1 + f )o;2- 7 keys. 

Proof. Assume that all the relevant j-bins have relative deviation at most |. Let I be an interval 
witnessing C^ p , that is, p £ I, 2 % < |/| < 2 l+1 , and the number of keys in I deviates by 5a\I\ from 
the mean a\I\. The interval I contains some number of the j-bins, and properly intersects at most 
two in the ends. The relative deviation within I is 5, but the j-bins have only half this relative 
deviation. This means that the j-bins contained in I can contribute at most half the deviation. The 
remaining |a|/| has to come from the two j-bins intersected in the ends. Those could contribute 
all or none of their keys to the deviation (e.g. all keys are on the last/first position of the interval). 
However, together they have at most 2(1 + f )a2- ? < 5a2 l ~ 1 < keys. □ 

Let 5 = 0(1) and define j as in Claim [9j Then j = i — 0(1). To bound the probability of C^^p 
it suffices to bound the probability that none of the 2 l+2 ~- J + 1 = 0(1) relevant j-bins has relative 
deviation beyond 5' = 5/2. We will apply Theorem [l] ([2]) with m' = m/2 J and // = a2- ? to each 
of these j-bins. Checking the conditions of Theorem [TJ we note that the fc'th relevant j-bin can 
specified as a function of p which again may be a function of the hash of the query. Also, as noted 
above, m' > m/2' 1 > n/(a2 l ) > n 1-1 ^ 2 ^. From Q we get that 

¥r[C iAp ] = O(l) ((1 + 5/2)- n ^ 1+5 / 2 ^ + (2-Vm) 7 ) = (1 + S)~m+S)^) + Q (( 2 */ m )7) . 
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5 < 1. We now consider the case S < 1. In particular, this covers the case 5 = o(l) which was not 
covered above. The issue is that if we apply Claim[9j we could get j = i — u(l), hence w(l) relevant 
j-bins, and then the applying the union bound would lead to a loss. To circumvent the problem we 
will consider a tight decomposition involving bins on many levels below i but with bigger deviations 
on lower levels. For any level j < i, we say that a j-bin is "dangerous" for level i if it has deviation 
at least: 

A ■ ■ - ■ <kt2 ' /o^-i)/ 5 — §°l . 

^*J,« — 24 / Z — 24 Z • 

Claim 10. Let jo be the smallest non-negative integer satisfying Aj 0i i < a2 J0 . IfCi s s p happens, 
then for some j S {jo, •••,«}, there is a relevant j-bin which is dangerous for level i. 

Proof. Witnessing Ci s,p, let I 3 p, 2 % < \I\ < 2 J+1 , have at least (1 + 5)a|/| keys. First we make 
the standard dyadic decomposition of I into maximal level bins: at most two j-bins on each level 
j = 0. .i. For technical reasons, if jo > 0, the decomposition is "rounded to level jo". Formally, 
the decomposition rounded to level jo is obtained by discarding all the bins on levels below jo, and 
including one jo-bin on both sides (each covering the discarded bins on lower levels). Note that all 
the level bins in the decomposition of I are relevant to Ci s tP . 

Assume for a contradiction that no relevant bin on levels jo, is dangerous for level i. In 
particular, this includes all the level bins from our decomposition. We will sum their deviations, 
and show that I cannot have the required deviations. In case of rounding, all keys in the two 
rounding jo-bins can potentially be in or out of / (all keys in such intervals can hash to the 
beginning/end), contributing at most Aj 0t i + a2 JO keys to the deviation in /. By choice of jo, we 
have a2^°~ l < Aj _i j. It follows that the total contribution from the rounding bins is at most 

2(A j(hi + aV ) < 2(A j0il + 2A A _ 1 , i ) < 6A M = 

The other bins from the decomposition are internal to /. This includes discarded ones in case of 
rounding. A j-bin contributes at most Aj j to the deviation in /, and there are at most 2 such 
j-bins for each j. The combined internal contribution is therefore bounded by 

A ;.* = 2 E ( -^j-/ 2( 3)1 ) = 12" S 1/2 < ~[2~ /{1 ~ 2 } < ~[2~ (10) 

j=0 j=0 ^ ' h=0 

The total deviation is thus at (| + ■^■)6a2 l , contradicting that I had deviation 8a2 i . □ 

For each j = we bound the probability that there exists a relevant j-bin which is 

dangerous for level i. There are 2 2+l ~^ + 1 such intervals. We have mean \i~ = a2 J and deviation 
SijUj = A id = 0(52l i+ y). Therefore 5 id = Aij/fij = 6(525^). Note that 5 id < 1 by choice of 
jo- We can therefore apply ([T]) from Theorem [TJ Hence, for any constant 7, the probability that 
there exists a relevant j-bin which is dangerous for i is bounded by 

(2 2+^ + 1} ( 2e -WL) + (^/my) < 0(2*"*) (e-n{<*v^-V?) + ($/my) 

O ( ai-i e -n(^* (w) ) + (27 m )7/ 2 (-*)(7-l) 



To bound the probability of Cj j( 5 jP , we sum the above bound for j = jo, We will argue that 
the j = i dominates. If 7 > 2, then clearly this is the case for the term O ((27m) 7 / 2(i ~ i)(7-1) ) ■ Jt 
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remains to argue that 

»-jo / 3 \ 

O ( 2 h e- n ^ s22 ^ J = 0(e-^( Q2152 )). (11) 

At first this may seem obvious since the increase with h is exponential while the decrease is doubly 

exponential. The statement is, however, not true if a2 t 5 2 = o(l), for e ^( a2 s 25 ) ~ i as long 
as a2 i 5 2 2l h = o(l). We need to argue that a2 i 5 2 = 17(1). Then for h = w(l), the bound will 
decrease super-exponentially in /i. Recall that our final goal for 5 < 1 is to prove that Pr[Q $ p ] < 
2exp(-0(a2*5 2 )) + O ((27m) 7 ). This statement is trivially true if exp(-0(a2 i <5 2 )) > 1/2. Thus 
we may assume exp(— Q(a2 l 5 2 )) < 1/2 and this implies a2 l b 2 = 0(1), as desired. Therefore the 



sum in ( 11 ) is dominated in by the case h = (i — j) = 0. Summing up, for any constant 7 > 2 and 



5 < 1, we have proved that 

i_ io / / . 3 0, \ 

Pr[C iAp \ = ^0(2V^( Q2I<52S )+(2Vm)V2^- 1 M 

= O (e- n ( a2 * s2 ) + (27m) 7 ) 

= 2e~ n ( a2!52 ) +0((2Vm) 7 ). 

The constraint 7 > 2 has no effect, since we get better bounds with larger 7 as long as 7 remains 
constant. All together, for a2 % < n 1 ^ 2 ^ or equivalently, m/2 J > re 1_1 /( 2c ), we have proved 

Vr\C , K J 2e"^ a2152 ) + (27™) 7 if <5 < 1 

[ * JS ( i + (5 )^((i+^) + (2 Y m)7 if ,5 = 0(1) 



We now want to bound PrfP^^p] as in (|8j). For our asymptotic bound, it suffices to consider 
cases where I = 2 k is a power of two. Essentially we will use the trivial bound Pr[2? 2 fc ,<5,p] — 
X^oPrpfc+hAp]- First we want to argue that the terms e - n ( a2k+hs2 ) = e -n(a2 h 8 2 )2 h ^ <y < ^ anc i 
(1 + £)-n((i+*)o2*+*) = (1 + () -)-n((i+5)a2'=)2' 1 ) s = are dominated by the case /i = 0. Both 

terms are of the form 1/a 2 and we want to show that a = 1 + 0(1). For the case 6 < 1, we can 
use the same trick before: to prove ^ it suffices to consider exp(— Q(a2 k 8 2 )) < 1/2 which implies 
a2 fc 5 2 = 0(1) and e^ 2 " 5 ^ = 1 + 0(1). When it comes to (1 + $)-n((i+«)«2 fc+k ) j we have 5 = 0(1). 
Moreover, to get the strongest probability bound on Pr[X> 2 fc <jJ) we can assume (1 + 5)a2 k > 1. 
More precisely, suppose (1 + 5)a2 k < 1 and define 8' > 8 such that (1 + 5')a2 k = 1. If an interval 
is non-empty, it has at least 1 key, so T> 2 k $ p <J=^ T^2 k ,5',pi an d the probability bound from (JsJ) 
is better with the larger 8' . Thus we can assume (1 + J)^((i+<5)«2 fe ) = ! + 0(1). We have now 
established that for any relevant 8, the bound from ( |12[ ) is of the form 

l/a 2h + (2 k+h /my where a = 1 + 0(1). 

As desired the first term is dominated by the smallest h = 0. Two issues remain: the second term 



is dominated by larger h and (12) only applies when m/2 k+h > n 1 1 /( 2c ). Define h as the smallest 
value such that c? h > m/2 k . We have 2^ = |Tog a (m/2 fc )] = 0(log(m/2 fc )) and the condition for 
* is that n 1 " 1 /^) < m/2 k , so m/2 k+Jl = {m/2 k )/0(\og{m/2 k )) = O^ 1 " 1 /^) > n 1 " 1 /^). We 



conclude that (12) applies for any h < h. 
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To handle h > h we consider the more general event A4 5 that the contribution to any interval 
of length at least 2 l has relative deviation at least 5. It is easy to see that 

Pr[A,5] <m/^--Px[C^]. (13) 

More precisely, consider the m/2 % points p that are multiples of 2*. Any interval / of length > 2 % 
can be partitioned into intervals Ij such that 2* < \Ij\ < 2 l+1 and pj = jm/2 l £ Ij. If I has 
relative deviation 5, then so does some Ij, and then Ci t g t p is satisfied. Thus (13) follows. With our 
particular value of i = k + h, for any 7 > 1, we get 

Pr[A k+htS ] < m/2 k+ ~ h Pr[C k+hAp ] = (l// + 

< m/2 k + h (2 k /m + (2 k+ ~ h /m)^ 

= {2 k+ ~ h+l / m y- 1 = n^/my- 1 

Finally we are ready to compute PT[V 2 k tS>p ] < J2h=l Pr [Cfc+M, P ] + PT [^-k+h,s\- In Pr [Cfc+M,p] = 
1/a 21 + (2 k+h /m)' y the terms 1/a 2 were dominated by h = 0, and the terms {2 k+h /m) 1 are 
dominated by h = h which is covered by Pr[^l fe+ ^ s }. We conclude that 

/ 2e-^ 2 ) + 0(2Vm)^ 1 if <5 < 1 

« ^2V, P j - I (1 + 5 )-n((i+«)«a fc ) + f)( 2 Vm)^ 1 if 5 = 0(1) 

Since 7 can always be picked larger, this completes the proof of ^ in Theorem [8j 
3.1 The cost of linear probing 

We now return to the costs of the different operations with linear probing and simple tabulation 
hashing. We have stored a set S of n keys in a table of size m. Define the fill a = n/m and 
e = 1 — a. For any key q we let R = R(q, S) be the number of filled positions from the hash 
location of q to the nearest empty slot. For insertions and unsuccessful searches, we have q S, 
and then the number of cells probed is exactly R(q,S) + 1. This also expresses the number of 
probes when we delete, but in deletions, we have q £ S. Finally, in a successful search, the number 
of probes is bounded by R(q, S\{q}) + 1. From Theorem[8]we get tail bounds on R(q, S) including 
the case where q £ S: 

Corollary 11. For any 7 = 0(1) and £ < n 1 '^ 30 ' /a, 

Pv\R(a S)>£}<( 2e " n(&2) + Wm)1 lfa ^ 1/2 (14) 

Proof. When q S, we simply apply ^ from Theorem [8] with p = h(q). If e < 1/2, we use 5 = e, 
and if a < 1/2, we use (1 + 5) = 1/a implying 5 > 1/2. In fact, we can do almost the same if 
q G S. We will only apply Theorem [8] to S' = S \ {q} which has fill a' < a. If a > 1/2 we note 



that (14) does not provide a non-trivial bounds if £ = 0(l/e 2 ), so we can easily assume £ > 2/e. 
For any interval of this length to be filled by S, the contribution from S' has to have a relative 
deviation of at least e/2. For a < 1/2, we note that we can assume £ > 4, and we choose 5 such 
that (1 + 25) = 1/a. Since 1/a > 2, we have (1 + 5) < (3/4) /a. For an interval of length £ to be 
full, it needs (1 + 25)a£ >! + (! + 5)a£ keys from S, so it needs at least (1 + 5) at keys from S' . 



Now (14) follows from M since (1 + 5) > y/1 + 25 = ^/l/a. □ 
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From Corollary 11 it follows that we for a > 1/2 get a tight concentration of R(q,S) around 
9(l/e 2 ), e.g., for any moment p = O(l), E[R(q,S) p ] = 0(l/e 2p ). 

does not offer strong bound on Pv[R(q, S) > 0]. 



e.g., for any moment p = 0(1 
Now consider smaller fills a < 1/2. Corollary 
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It works better when R(q, S) exceeds some large enough constant. However, in Section [6j we show 
that simple tabulation satisfies a certain 4 th moment bounds, and in Section 6.4 (26), we show that 



when q S, this implies that linear probing fills a location depending on h(q) with probability 



0(a). Thus we add to Corollary 11 that for q S, 

Pi[R(q,S) > 0] 



0(a) 



(15) 



Combining this with the exponential drop for larger R(q, S) in Corollary 11, it follows for any 
constant moment p that E[R(q, S) p ] = 0(a) when q ^ S. 

Now consider q £ S as in deletions. The probability that S' = S\{q} fills either h(q) or h(q) + 1 
is 0(a). Otherwise S fills h(q) leaving h(q) + 1 empty, and then R(q, S) = 1. Therefore, for q £ S, 



Pr[R(q,S) > 1] = 0(a) 



(16) 



Combining this with the exponential drop for larger R(q, S) in Corollary 11, it follows for any 
constant moment p that E[R(q, S) p ] = 1 + 0(a) when q £ S. 



3.2 Larger intervals 

To finish the proof of Theorem [8j we need to consider the case of larger intervals. We want to show 
that, with probability 1 — ra~ 7 for any 7 = 0(1), for every interval I where the mean number of 
keys is a\I\ > 1, the deviation is at most 



O (^/a\T\log c n 



Consider an interval / with a\I\ > 1. As in the proof of Claim [lOj we consider a maximal dyadic 
decomposition into level bins with up to two j-bins for each j < i = [log 2 ■ Let jo = [log 2 (l/a)] • 
Again we round to level jo, discarding the lower level bins, but adding a jo-bin on either side. The 
deviation in / is bounded by total deviation of the internal bins plus the total contents of the side 
bins. The expected number of keys in each side bins is a2- J0 < 2. 

For each j £ {jo,...,i}, we apply Theorem [l] with m! = mjl? < am = n bins. W.h.p., the 
maximal deviation for any j-bins is O {^Jn/m! log c nj = O (^V a2^> log c nj . This gives a total 
deviation of at most 



2 ( 2 + (Va2Mog c n) + ^Ojjv^k^n) j = O ^(y^aJT\log c nj , 



3=]0 



as desired. For each j there is an error probability of n 7 ' for any 7' = 0(1). The error probability 



over all j £ {jo, is (i—jo + l)n 7 . Here i — jo < log 2 m — log 2 (l/a) = log 2 m- 
so (i — jo + l)re~ 7 < n~ 7 '(l + log re). This completes the proof Theorem 



log 2 



log 2 re, 
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3.3 Set estimation 



We can easily apply our results for set estimation where one saves a bottom-A; sketch. More precisely, 
suppose we for a set A store a sample S consisting of the k keys with the smallest hash values. 
Consider now some subset B C A. We then use \B D S\/k as an estimator for |5|/|^4|. We can 
use the above bounds to bound the probability that this estimator is wrong by more than a factor 
jzjj. Let r be the fcth hash value of A. First we use the bounds to argue that r = (1 ± <5)&/|A|. 
Next we use them to argue that the number of elements from B below any given r' is (1 ± 5)t'\B\. 
Applying this with r' = (1 — <5)fc/|-A|, (1 + <5)fc/|-A|, we get the desired bound. 

4 Cuckoo Hashing 

We are now going to analyze cuckoo hashing. In our analysis of chaining and linear probing, we 
did not worry so much about constants, but with Cuckoo hashing, we do have to worry about 
obstructions that could be stem from the hashing of just a constant number of keys, e.g., as an 
extreme case we could have three keys sharing the same two hash locations. It is, in fact, a constant 
sized obstruction that provides the negative side of our result: 

Observation 12. There exists a set S of n keys such that cuckoo hashing with simple tabulation 
hashing cannot place S into two tables of size 2n with probability $7(n -1 / 3 ). 

Proof. The hard instance is the 3-dimensional cube [n 1 / 3 ] 3 . Here is a sufficient condition for cuckoo 
hashing to fail: 

• there exist a,b,c£ [n 1 / 3 ] 2 with ho(a) = ho(b) = ho(c); 

• there exist x,y G [n 1 / 3 ] with h\{x) = h\(y). 

If both happen, then the elements ax, ay, bx, by, cx, cy cannot be hashed. Indeed, on the left 
side ho(a) = ho(b) = ho(c) so they only occupy 2 positions. On the right side, h\{x) = h\(y) so 
they only occupy 3 positions. In total they occupy 5 < 6 positions. 

The probability of 1. is asymptotically (n 2 / 3 ) 3 /n 2 = $7(1). This is because tabulation (on two 
characters) is 3-independent. The probability of 2. is asymptotically (n 1 / 3 ) 2 jn = ^(l/n 1 / 3 ). So 
overall cuckoo hashing fails with probability $7(n -1 / 3 ). □ 

Our positive result will effectively show that this is the worst possible instance: for any set S, 
the failure probability is 0(n -1 / 3 ). 

The proof is an encoding argument. A tabulation hash function from S c i— > [m] has entropy 
|E| c lgm bits; we have two random functions ho and hi. If, under some event £, one can encode 
the two hash functions ho, hi using (2|S| c lgm) — 7 bits, it follows that Pr[£] = 0(2~ 7 ). Letting 
£s denote the event that cuckoo hashing fails on the set of keys S, we will demonstrate a saving 
of 7 = |lgn — f{c,e) = |lgn — O(l) bits in the encoding. Note that we are analyzing simple 
tabulation on a fixed set of n keys, so both the encoder and the decoder know S. 

We will consider various cases, and give algorithms for encoding some subset of the hash codes 
(we can afford 0(1) bits in the beginning of the encoding to say which case we are in). At the 
end, the encoder will always list all the remaining hash codes in order. If the algorithm chooses 
to encode k hash codes, it will use space at most klgm — | lgn + 0(1) bits. That is, it will save 
I lgn — 0(1) bits in the complete encoding of ho and hi. 
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Figure 1: Minimal obstructions to cuckoo hashing. 

4.1 An easy way out 

A subkey is a set of position-characters on distinct positions. If a is a subkey, we let C(a) = {x £ 
S | a C x] be the set of "completions" of a to a valid key. 

We first consider an easy way out: there subkeys a and b on the positions such that |C(a)| > 
n 2 / 3 , |C(6)| > n 2 / 3 , and hi(a) = hi(b) for some i G {0, 1}. Then we can easily save g lgra — 0(1) 
bits. First we write the set of positions of a and b, and the side of the collision (c + 1 bits). There 
are at most n 1 / 3 subkeys on those positions that have > n 2 / 3 completions each, so we can write 
the identities of a and b using | lg n bits each. We write the hash codes hi for all characters in 
aAb (the symmetric difference of a and b), skipping the last one, since it can be deduced from the 
collision. This uses c+l + 2- glgn + (|aA6| — 1) lgm bits to encode |aA6| hash codes, so it saves 
\lgn - 0(1) bits. 

The rest of the proof assumes that there is no easy way out. 

4.2 Walking Along an Obstruction 

Consider the bipartite graph with m nodes on each side and n edges going from ho{x) to h±(x) 
for all x £ S. Remember that cuckoo hashing succeeds if and only if no component in this graph 
has more edges than nodes. Assuming cuckoo hashing failed, the encoder can find a subgraph with 
one of two possible obstructions: (1) a cycle with a chord; or (2) two cycles connected by a path 
(possibly a trivial path, i.e. the cycles simply share a vertex). 

Let vq be a node of degree 3 in such an obstruction, and let its incident edges be ao, a%, «2- The 
obstruction can be traversed by a walk that leaves vq on edge ao, returns to vq on edge a±, leaves 
again on d2, and eventually meets itself. Other than visiting vq and the last node twice, no node 
or edge is repeated. See Figure [TJ 

Let x\,X2, ■ ■ ■ be the sequence of keys in the walk. The first key is x\ = ao- Technically, when 
the walk meets itself at the end, it is convenient to expand it with an extra key, namely the one it 
first used to get to the meeting point. This repeated key marks the end of the original walk, and 
we chose it so that it is not identical to the last original key. Let x<i = Uj<« x j ^ e the position- 
characters seen in keys up to Xj. Define X{ = Xi\ x<i to be the position-characters of Xj not seen 
previously in the sequence. Let k be the first position such that Xk+\ = 0. Such a k certainly exists, 
since the last key in our walk is a repeated key. 

At a high level, the encoding algorithm will encode the hash codes of xi,. .. in this order. 
Note that the obstruction, hence the sequence {xi), depends on the hash functions ho and h\. Thus, 
the decoder does not know the sequence, and it must also be written in the encoding. 
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For notational convenience, let hi = /ij mo d2- This means that in our sequence Xj and Xj+i 
collide in their hi hash code, that is hi(xi) = hi(xi+i). Formally, we define 3 subroutines: 

Id(x): Write the identity of x £ S in the encoding, which takes lgn bits. 

H ashes ( hi, Xk): Write the hash codes hi of the characters x&. This takes \xy\ lgm bits. 

Coll(xi, Xj+i): Document the collision hi(xi) = hi{xi+\). We write all hi hash codes of characters 
Xi U ii + i in some fixed order. The last hash code of XjAxj + i is redundant and will be 
omitted. Indeed, the decoder can compute this last hash code from the equality hi{xi) = 
hi{xi+i). Since x^+i = Xj+i \ x<j, Xj+i \ Xj 7^ 0, so there exists a hash code in XjAxj+i. This 
subroutine uses (|xjL)Xj + i| — l) lgm bits, saving lgm bits compared to the trivial alternative: 
Hashes(/ij, Xi); Hashes^, x i+ i). 

To decode the above information, the decoder will need enough context to synchronize with the 
coding stream. For instance, to decode COLL(xj, x«+i), one typically needs to know i, and the 
identities of Xi and Xj+i. 

Our encoding begins with the value k, encoded with O(lgfc) bits, which allows the decoder to 
know when to stop. The encoding proceeds with the output of the stream of operations: 

Id(xi);Hashes(/io, xi); Id(x 2 ); Coll(xi, X2); 
. . . lD(x fe ); COLL(x fe , x fc _i); HASHES(/l fc , x fc ) 

We observe that for each i > 1, we save e bits of entropy. Indeed, Id(xj) uses lgn bits, but 
Coll(xj_i, Xi) then saves lgm = lg((l + e)n) > e + lgn bits. 

The trouble is Id(xi), which has an upfront cost of lgn bits. We must devise algorithms that 
modify this stream of operations and save | lgn — 0(1) bits, giving an overall saving of | lgn— O(l). 
(For intuition, observe that a saving that ignores the cost of Id(xi) bounds the probability of an 
obstruction at some fixed vertex in the graph. This probability must be much smaller than 1/n, 
so we can union bound over all vertices. In encoding terminology, this saving must be much more 
than lgn bits.) 

We will use modifications to all types of operations. For instance, we will sometimes encode 
Id(x) with much less than lgn bits. At other times, we will be able to encode Coll(xj, Xj+i) with 
the cost of \xi U Xj+i| — 2 characters, saving lgn bits over the standard encoding. 

Since we will make several such modifications, it is crucial to verify that they only touch distinct 
operations in the stream. Each modification to the stream will be announced at the beginning of 
the stream with a pointer taking 0(lgk) bits. This way, the decoder knows when to apply the 
special algorithms. We note that terms of 0(lg k) are negligible, since we are already saving ek bits 
by the basic encoding (e bits per edge). For any k, 0(lg k) < ek + f(c, e) = k + 0(1). Thus, if our 
overall saving is | lgn — 0(lg k) + ek, it achieves the stated bound of lgn — 0(1). 

4.3 Safe Savings 

Remember that x^+i = 0, which suggests that we can save a lot by local changes towards the end 
of the encoding. We have x^+i C x<fc, so x^+i \ x </ t C x^. We will first treat the case when 
x k+i \ x <k is a proper subset of x& (including the empty subset). This is equivalent to x& x^. + i. 

Lemma 13 (safe-strong). If Xk <£. Xk+i, we can save lgn — O(clgk) bits by changing HASHES(xfc). 
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Proof. We can encode lD(x k+ i) using c\gk extra bits, since it consists only of known characters 
from x< k . For each position 1 . . c, it suffices to give the index of a previous xt that contained the 
same position-character. Then, we will write all hash codes h k for the characters in x/-, except 
for some a £ x k \ x k+1 . From h k (x k ) = h k (x k +i), we have h k (a) = h k (x k \ {a}) h k (x k+ i). All 
quantities on the right hand side are known (in particular a £ Xfc+i), so the decoder can compute 
h k (a). □ 

It remains to treat the case when the last revealed characters of x k+ \ are precisely x k : x k C x k -\~i- 
That is, both x k and x^+i consist of x k and some previously known characters. In this case, the 
collision h k (x k ) = h k (x k+ \) does not provide us any information, since it reduces to the trivial 
h k (x k ) = h k (x k ). Assuming that we didn't take the "easy way out", we can still guarantee a more 
modest saving of ^ lg n bits: 

Lemma 14 (safe- weak). Let K be the set of position-characters known before encoding Id(xj), and 
assume there is no easy way out. If XiAxi + i C x<j, then we can encode both Id(xj) and Id(xj+i) 
using a total of I lgn + 0(c lg \K\) bits. 



A typical case where we apply the lemma is i = k and K = x <k . If x k C x/t+i, we have 

| lg n bits, which saves ^ 



x k Ax k +i C K. Thus, we can obtain \n(x k ) for roughly I lgn bits, which saves ^lgn bits. 



Proof of Lemma 14_, With 0(c lg k) bits, we can code the subkeys XiDx^ and Xj+iPlx^. It remains 
to code z = Xi \ x<i = Xi + \ \ x^. Since z is common to both keys xi and Xj+i, we have that Xi\z 
and Xi + \ \ z are subkeys on the same positions. With no easy way out and hi(xi \ z) = hi{xi + \ \ z), 
we must have \C(xi \ z)\ < n 2 / 3 or |C(xj-|-i \ z)\ < n 2 / 3 . In the former case, we code z as a member 
of C(x{ \ z) with [~! lgn] bits; otherwise we code z as member of C(xj+i \ z). □ 

4.4 Piggybacking 

Before moving forward, we present a general situation when we can save lgn bits by modifying a 
COLL(xj, Xi + \) operation: 

Lemma 15. We can save lgn — 0{\gk) bits by modifying COLL(xj, x«+i) if we have identified two 
(sub)keys e and f satisfying: 

hi{e) = hi(f); eAf C x< i+1 ; / (eA/) \ x <{ / (xiAx i+1 ) \ x Ki . 

Proof. In the typical encoding of COLL(xj, a^j+i), we saved one redundant character from hi(xi) = 
hi(xi+\), which is an equation involving (xjAxj+i) \x<j and some known characters from x<j. The 
lemma guarantees a second linearly independent equation over the characters Xi U so we can 
save a second redundant character. 

Formally, let a be a position-character of (eAf) \x<j, and /3 a position-character in (xjAxj + i) \ 
but outside (eAf) \ x<j. Note /5 / a and such a /3 exists by assumption. We write the hi hash 
codes of position characters (xi U ij+i) \ {a, f3}. The hash hi(a) can be deduced since a is the last 
unknown in the equality hi(e \ f) = hi(f \e). The hash hi(/3) can be deduced since it is the last 
unknown in the equality hi(x) = /ij(xj+i). □ 

While the safe saving ideas only require simple local modifications to the encoding, they achieve 
a weak saving of | lgn bits for the case x k C x k +\. A crucial step in our proof is to obtain a saving 
of lgn bits for this case. We do this by one of the following two lemmas: 
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Lemma 16 (odd-size saving). Consider two edges e,f and an i < k — 2 satisfying: 
h i+1 (e) = h i+ i(f); e \ x<i / / \ x<f, e\ x< i+1 = f \ x< i+1 . 
We can save lgn — O(clgk) bits by changing COLL(xj + i, Xi + 2). 



Proof. We apply Lemma 15 with the subkeys e = e \ f and / = / \ e. We can identify these 
in 0(c lg k) bits, since they only contain characters of x<i+\. Since e and / have different free 
characters before scj+i, but identical free characters afterward, it must be that eU / C by 
eU / ^ x<{. To show (eA/) \ x<j ^ (xi+iAxi+2) \ x<i, remark that 7^ and cannot have 



characters of e U /. Thus, Lemma 15 applies. □ 

Lemma 17 (piggybacking). Consider two edges e, / and an i < k — 1 satisfying: 

hi{e) = hi(f); e\ x<i ^ f \ x<f, e \ x< i+ i = f \ x< i+1 . 

We can encode lD(e) and Id(/) using only 0{c\gk) bits, after modifications to lT>(xi), Id(xj-i-i), 
and COLL(a;j, Xj+i). 

The proof of this lemma is more delicate, and is given below. The difference between the two 
lemmas is the parity (side in the bipartite graph) of the collision of X{ and Xj+i versus the collision 
of e and /. In the second result, we cannot actually save lgn bits, but we can encode lD(e) and 
Id(/) almost for free: we say e and / piggyback on the encodings of Xi and Xi + \. 

Through a combination of the two lemmas, we can always achieve a saving lgn bits in the case 
Xk C Xk+i, improving on the safe- weak bound: 

Lemma 18. Assume k is minimal such that x^ C x^+i . We can save lgn — 0{c\gk) bits if we 
may modify any operations in the stream, up to those involving x^+i ■ 

Proof. We will choose e = Xk and / = Xk+i- We have e \ = / \ £</c = Xk- On the other 
hand, e \ x\ ^ / \ x\ since x\ only reveals one character per position. Thus there must be some 
1 < i < k — 1 where the transition happens: e \ x<i 7^ / \ x<i but e \ x<»+i = f\ x<i+i. If i has 



the opposite parity compared to k, Lemma 16 saves a lgn term. (Note that i < k — 2 as required 
by the lemma.) 



If i has the same parity as k, Lemma 17 gives us iD(xk) at negligible cost. Then, we can 
remove the operation iD(xfc) from the stream, and save lgn bits. (Again, note that i < k — 2 as 
required.) □ 

Proof of Lemma\V\ The lemma assumed e \ x<i / / \ x<« but e \ x<j+i = / \ £<i+i- Therefore, 



eA/ C £<j-t_i and (eA/) n 7^ 0. Lemma 15 applies if we furthermore have (eA/) \ x<j 7^ 



(xjAxj+i) \ £<j. If the lemma applies, we have a saving of lgn, so we can afford to encode lD(e). 
Then Id(/) can be encoded using 0(c lg k) bits, since / differs from e only in position-characters 
from x<i+i. 

If the lemma does not apply, we have a lot of structure on the keys. Let y = Xi \ (e U /) and 
g = e \ x<i + \ = f \ x<i + \. We must have y C for otherwise Xi \ Xi + \ contains an elements 

outside eA/ and the lemma applies. We must also have Xj+i C e U /. 

We can write Id(xj), lD(xj + i), lD(e), and Id(/) using 2 lg n + 0(c lg k) bits in total, as follows: 

• the coordinates on which y and g appear, taking 2c bits. 
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• the value of y using Huffman coding. Specifically, we consider the projection of all n keys on 
the coordinates of y. In this distribution, y has frequency so its Huffman code will use 
Ig 5^ + 0(1) bits. 

• the value of g using Huffman coding. This uses lg + 0(1) bits. 

• if C(y) < C(g), we write Xi and x; L+ i. Each of these requires [log 2 C(y)~\ bits, since y C Xj, Xi + \ 
and there are C(y) completions of y to a full key. Using an additional 0{c\gk) bits, we can 
write eC\x<i + i and /flx<i + i. Remember that we already encoded g = e\x<j + i = /\x<j + i, 
so the decoder can recover e and /. 

• if C(g) < C(y), we write e and /, each requiring |~log 2 C(g)~\ bits. Since we know y = Xi\(eUf), 
we can write Xi using 0(c\gk) bits: write the old characters outside Xj, and which positions 
of e U / to reuse in x%. We showed x^+i C e U /, so we can also write a?t+i using 0(c lg k). 

Overall, the encoding uses space: lg p^y+lg c ^ ^ +21gmin (C(£), C(ei + i)} + 0(clg fc) < 21gn+ 
0(c lgfc) " . □ 

4.5 Putting it Together 

We now show how to obtain a saving of at least | lgn — 0(c lg fc) bits by a careful combination of 
the above techniques. Recall that our starting point is three edges ao, ai, «2 with /io(ao) = ^o(ai) = 
/io(«2)- The walk xi, x^+i started with sci = ao an d finished when Xfc+i = 0. We will now involve 
the other starting edges a\ and a 2 . The analysis will split into many cases, each ended by a 'O'. 

Case 1: One of a\ and a 2 contains a free character. Let j G {1,2} such that aj ^ x <fc- 
Let yi = aj. We consider a walk yi, y 2 , . . . along the edges of the obstruction. Let y% = yi\ \ y<i 
be the free characters of yi (which also takes all Xi's into consideration). We stop the walk the first 
time we observe yi+i = 0. This must occur, since the graph is finite and there are no leaves (nodes 
of degree one) in the obstruction. Thus, at the latest the walk stops when it repeats an edge. 

We use the standard encoding for the second walk: 

lD(yi);COLL(a , yi); Io(y 2 ); COLL(y 2 , yi ); 
. . . ;lD(y £ ); CoLL(y t -i,yt); Hashes(/i £ , y t ) 
Note that every pair Id(?/j), CoLL(?/j_i, yj) saves e bits, including the initial 



Ip(yi), COLL(ao, y±). To end the walk, we can use one of the safe savings of Lemmas 13 
and 14 These give a saving of ^Ign — 0(clg(^ + k)) bits, by modifying only Hashes(/i£, yi) or 
Id(^). These local changes cannot interfere with the first walk, so we can use any technique 
(including piggybacking) to save lgn — 0(c log k) bits from the first walk. We obtain a total saving 
of | lgn — O(l), as required. O 

We are left with the situation a\ U a 2 C x<f~. This includes the case when a\ and a 2 are actual 
edges seen in the walk x±, . . . , x&. 

Let tj be the first time aj becomes known in the walk; that is, dj % x <tj but aj C x<t - By 
symmetry, we can assume t\ <t2- We begin with two simple cases. 



Case 2: For some j £ {1,2}, tj is even and tj < k. We will apply Lemma 15 and save 
lgn — O(clgk) bits by modifying Coll(x^. , x^+i). Since tj < k, this does not interact with safe 
savings at the end of the stream, so we get total saving of at least 3 lgn — O(clgk). 



We apply Lemma 15 on the keys e = ao and / = aj. We must first write Id(oj), which takes 
0(c lg k) bits given x<&. We have ao U aj C x<u by definition of tj. Since aj n xt j 7^ and 
%tj+i H (aj U ao) = 0, the lemma applies. O 
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Case 3: For some j G {1,2}, tj is odd and aj \ x<tj—i 7^ &tj— l^&tj- This assumption is 



exactly what we need to apply Lemma 15 with e = ao and / = aj. Note that ho(e) = ho(f) and 
tj is odd, so the lemma modifies COLL(xt j _i, xt- ). The lemma can be applied in conjunction with 
any safe saving, since the safe savings only require modifications to \d{x^) or HASHES(/ifc, Xk). O 

We now deal with two cases when t\ = t% (both being odd or even). These require a combination 
of piggybacking followed by safe-weak savings. Note that in the odd case, we may assume a\ \ 
x <t -\ = a>2 \ x<t_i = xt-i&xt (due to case 3 above), and in the even case we may assume 
t\ = t2 = k (due to case 2 above). 

Case 4: t\ = t% = t is odd and a\ \ x^-i = «2 \ £<t-i = xt-iAxf We first get a\ and 02 
by piggybacking or odd-side saving. Let i be the largest value such that a\ \ x<i 7^ 02 \ %<i- Since 
o-i \ x<t-i = a 2 \ x<t_i , we have i < t — 3. The last key that piggybacking or odd-side saving can 
interfere with is xt-2- 



We will now use the safe-weak saving of Lemma 14 to encode Id(xj_i) and Id(x(). The known 



characters are K = x <t -\ U a\ U 02, so x t -\Axt C K. Lemma 14 codes both Id(x£_i) and iD(xt) 



with I lgn + O(clgk) bits, which represents a saving of roughly | lgn over the original encoding 
of the two identities. We don't need any more savings from the rest of the walk after xt- O 
Case 5: t\ = t% = k is even. Thus, k is even and the last characters of a\ and 02 are only 
revealed by Xk- 

Lemma 19. We can save 2\gn — O(clgk) bits by modifying HASHES(/ifc, x^), unless both: (1) 
ai n ifc = 02 fl Xk; and (2)xk \ Xfc+i is the empty set or equal to a\ Pi Xk- 

Proof. The /io hash codes of the following 3 subkeys are known from the hash codes in aiCiXk, 
Q^nifc (both because we know /io(ao) = ^o(«i) = ^-0(^2)), and Xk\xk+i (since x^ and x^+i collide). 
If two of these subsets are distinct and nonempty, we can choose two characters a and j3 from their 
symmetric difference. We can encode all characters of x^ except for a and f3, whose hash codes can 
be deduced for free. 

Since aj n x^ ^ in the current case, the situations when we can find two distinct nonempty sets 
are: (1) a\ n Xk 7^ 02 H Xk] or (2) a>\ n Xk = a2 fl Xk but % \ Xk+i is nonempty and different from 
them. □ 

From now on assume the lemma fails. We can still save lgn bits by modifying HASHES(/ifc, x^). 
We reveal all hash codes of Xk, except for one position-character a £ a\ D We then specify 
iD(ai), which takes 0(clgk) bits. The hash ho(a) can then be deduced from ho(a\) = ho(ao). 

We will now apply piggybacking or odd-side saving to a\ and 02- Let i be the largest value with 



o-i \ x<i / 02 \ %<i- Note that a\ \ x<fc = ai \ cc<fe, so i < k — 1. If i is odd, Lemma 16 (odd-side 



saving) can save lgn bits by modifying COLL(xj+i, Xj + 2); this works since i + 2 < k. If i is even, 



Lemma 17 (piggybacking) can give use lD(a) and Id(6) at a negligible cost of O(clgk) bits. This 
doesn't touch anything later than Id(xj+i), where i + 1 < k. 

When we arrive at iD(xfc), we know the position characters K = x<fc U a\ U 02- This means 
that XfcAxfc + i C K, because x^ \ Xk+i is either empty or a subset of a\. Therefore, we can use 
weak-safe savings from Lemma 14 to code iD(xfc) in just \ \gn + 0(c lg/c) bits. In total, we have 



save at least | lgn — 0(c lg k) bits. O 
It remains to deal with distinct t±,t2, i.e. t\ < t2 < A;. If one of the numbers is even, it must 
be £2 = fc 5 and then t\ must be odd (due to case 2). By Case 3, if tj is odd, we also know 
aj \ x<tj-i = xt 3 -iAxt r Since these cases need to deal with at least one odd tj, the following 
lemma will be crucial: 



25 



Lemma 20. If tj < k is odd and aj\x < t—i = xt—iAx*., we can code Id (xt — 1) andlD(xt) with 
| lg n + 0(c lg k) bits in total. 

Proof. Consider the subkey y = x tj -\ \ xt r We first specify the positions of y using c bits. If 
C(y) > \/n, there are at most \fn possible choices of y, so we can specify y with ^ lgn bits. We can 
also identify xt j with lgn bits. Then Id(x^._i) requires 0(c lg k) bits, since Xt } -\ Q 2/U%, Ux<^_i. 

If C(y) < y/n, we first specify Id(x^_i) with lgn bits. This gives us the subkey y C art—i. Since 
aj \ x<t J _i = x^.-iA^., it follows that y C aj. Thus, we can write Id(oj) using lgC(y) < lg ^ lgn 
bits. Since xt j C x<t— 1 U aj, we get lD(a^.) for an additional O(clgfc) bits. □ 

Case 6: Both t\ and t2 are odd, t\ < t2 < k, and for all j £ {1,2}, aj \ x<tj-i = 



xt -iAxt-. We apply Lemma 20 for both j = 1 and j = 2, and save lgn bits in coding lD(xt 1 _i), 
lD(x tl ), 10(^2-1)5 and Id (xj 2 JT^ These are all distinct keys, because t\ < £2 and both are odd. 
Since t<i < k, we can combine this with any safe saving. O 
Case 7: ti = k is even and t\ < k is odd with a\ \ x<t 1 _i = xt 1 -iAxt 1 . We apply Lemma 
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for j = 1, and save 5 lgn — 0{c\gk) bits in coding lD(x tl -i), lD(x tl ). We also save lgn bits 
by modifying Hashes(/io, x^). We reveal all hash codes of x^, except for one position-character 
a £ 02 n xt (which is a nonempty set since ti = k). We then specify lD(a2), which takes O(clgk) 
bits. The hash ho(a) can then be deduced from ^0(02) = /io(«o)- ^ 
Case 8: Both t\ and £2 are odd, t\ < £2 = k, and for all j G {1,2}, aj \ x<tj-i = 
xt -iAxt-. To simplify notation, let t\ = t. This case is the most difficult. If we can apply 



strong-safe saving as in Lemma 13, we save lgn by modifying Hashes(/i/%, Xk). We also save lgn by 
two applications of Lemma 20, coding lD(xt_i), iD(xt), lD(xfc_i), and iD(xfc). These don't interact 
since t < k and both are odd. 

The strong-safe saving fails if Xk C Xk+i- We will attempt to piggyback for Xk and Xk+\- Let i 
be the largest value such that Xk \ x<i 7^ Xk+i \ x<j. If i is even, we get an odd-side saving of lgn 



(Lemma 16). Since this does not affect any identities, we can still apply Lemma 20 to save ^lgn 
on the identities iD(xt-i) and iD(xt). 

Now assume i is odd. We have real piggybacking, which may affect the coding of Id(xj), Id(xj+i) 
and iD(xjt). Since both i and t are odd, there is at most one common key between {xj,Xj + i} and 
{xt-i,xt}. We consider two cases: 

• Suppose xt~\ £ {xi,Xi + i}. Let y = i.t~ \ \xt- After piggybacking, which in particular encodes 
xt, we can encode Id(x£_i) in lg + 0(c lg k) bits. Indeed, we can write the positions of y 
with c bits and then the identity of y using Huffman coding for all subkeys on those positions. 
Finally the identity of xt-i can be written in 0(c lg k) bits, since xt-i C £<i-i L) y U Xf. 

• Suppose xt ^ {xj,Xj+i}. Let y = xt\ xt-i- As above, we can write Id(xj) using lg + 
O(clgk) bits, after piggybacking. 

If C(y) > n 1 / 3 , we have obtained a total saving of |lgn — 0(c IgA;): a logarithmic term for 
iD(xfc) from piggybacking, and |lgn for lD(xt„i) or Id(x(). 

Now assume that C(y) < n 1 / 3 . In this case, we do not use piggybacking. Instead, we use a 
variation of Lemma [20| to encode lD(xt_i) and iD(xt). First we code the one containing y with 
lgn bits. Since a\ \ x<t-\ = Xt-iAxt, and therefore y C a\, we have y C a\. We code Id(oi) with 
lgC(y) < ^ lgn bits. We obtain the other key among xt~\ and xt using O(clgk) bits, since all its 
characters are known. Thus we have coded iD(xt-i) and Id(x^) with | lgn + O(clgfe) bits, for a 
saving of roughly I lgn bits. 
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Next we consider the coding of Id(x^_i) and iD(xfc). We know that 02 \ x < k-i = x^iAx/, 
and Xk C Xk+i- Lemma 20 would guarantee a saving of \ lgn bits. However, we will perform an 



analysis like above, obtaining a saving of I lgn bits. 



14 



to 



Let y = x^i \ Xk- First assume C(y) > n 1//3 . We use the safe-weak saving of Lemma 
encode iD(xfc) using | lg n bits. We then encode the subkey y using lg ^jy+0(c) < | lg n+0(c) bits, 

and finally x^-i using O(clgk) bits. This obtains both iD(xfc-i) and iD(xk) using | \gn + 0(c\gk) 
bits. 

Now assume C(y) < n 1 / 3 . We first code lD(a;fc_i) using lgn bits. This gives us y for the price 
of c bits. But ci2 \^<a,.-i = ifc-iAxfc, so y C 02, and we can code Id(o2) using \gC{y) < | lgn bits. 
Then Id(x/c) can be coded with 0{c lg k) bits. Again, we obtain both lD(xfc_i) and iD(xfc) for the 
price of | lg n + 0(c lg &;) bits. O 

This completes our analysis of cuckoo hashing. 



5 Minwise Independence 

We will prove that: 



The lower bound is relatively simple, and is shown in £5.1 The upper bound is significantly more 
involved and appears in §5.2[ 

For the sake of the analysis, we divide the output range [0, 1) into j bins, where I = 7 lg n for a 
large enough constant 7. Of particular interest is the minimum bin [0, ^). We choose 7 sufficiently 
large for the Chernoff bounds of Theorem [T] to guarantee that the minimum bin in non-empty 
w.h.p.: Pr[min/i(X) < > 1 - 4,. 



In ^5.1 and |5.2[ we assume that hash values h(x) are binary fractions of infinite precision 



(hence, we can ignore collisions). It is easy to see that (17) continues to hold when the hash codes 
have (1 + -) lgn bits, even if ties are resolved adversarially. Let h be a truncation to (1 + =) lgn 
bits of the infinite-precision h. We only have a distinction between the two functions if q is the 
minimum and (3)x 6 S : h(x) = h(q). The probability of a distinction is bounded from above by: 

Pr[%)<^ A (3)x€S:h(x)=h(q)] < £• (n • ^) < 

We used 2-independence to conclude that {h(q) < ^} and {h(x) = h(q)} are independent. 
Both the lower and upper bounds start by expressing: 

Pr[h(q) < mm h(S)] = / f(p)dp, where f(p) = Pi[p < mmh(S) \ h(q) = p]. 

Jo 

For truly random hash functions, Pr[j> < mmh(S) \ h{q) = p] = (1 — p) n , since each element has 
an independent probability of 1 — p of landing about p. 

5.1 Lower bound 

For a lower bound, it suffices to look at the case when q lands in the minimum bin: 

rl/n 

Pv[h(q) < mmh(S)} > / f(p)dp, where f(p) = Pr[p < mmh(S) \ h(q) = p] 

Jo 



27 



We will now aim to understand f(p) for p G [0, — ]. In the analysis, we will fix the hash codes 
of various position-characters in the order -$ given by Lemma [7J Let h(+a) done the choice for all 
position-characters (3 -< a. 

Remember that -< starts by fixing the characters of q first, so: q\ -< • • • ~< q c ~< ao -< act ~< ■ ■ ■ 
Start by fixing h(q±), . . . , h(q c ) subject to h(q) = x. 

When it is time to fix some position-character a, the hash code of any key x E G a is a 
constant depending on h(-<a) xor the random quantity h(a). This final xor makes h(x) uniform 
in [0,1). Thus, for any choice of h(-< a), Pv[h(z) < p \ h(~< a)] = p. By the union bound, 
Pr[p < min/i(G a ) | h(~<a)] > 1— p- \G a \. This implies that: 

f(p) = p T [ p < minh(S) \ h{q) =p] > J] (1 - p ■ \G a \). (18) 

a)~qc 

To bound this product from below, we use the following lemma: 

Lemma 21. Let p G [0, 1] and k>0, where p ■ k < \/2 - 1. Then 1 - p ■ k > (1 -p)( 1 +P k ) k . 

Proof. First we note a simple proof for the weaker statement (1 — pk) < (1 — p)r( 1 +P fc ) fc l. However, 
it will be crucial for our later application of the lemma that we can avoid the ceiling. 

Consider t Bernoulli trials, each with success probability p. The probability of no failures 
occurring is (1 — p) 1 . By the inclusion-exclusion principle, applied to the second level, this is 
bounded from above by: 

(1-p)* < l-t-p+Qjp 2 < l-(l-£)t-p 

Thus, 1 — kp can be bounded from below by the probability that no failure occurs amount t Bernoulli 
trials with success probability p, for t satisfying t ■ (1 — > k. This holds for t > (1 + kp)k. 

We have just shown 1 — p ■ k > (1 — p)r( 1 +P fc ) fc l. Removing the ceiling requires an "inclusion- 
exclusion" inequality with a non-integral number of experiments t. Such an inequality was shown 
by Gerber Ger68 : (1 - pf < 1 - at + (at) 2 /2, even for fractional t. Setting t = (1 + pk)k, our 
result is a corollary of Gerber's inequality: 

(1-p)' < 1-pi+^J 2 = 1 - p(\ + pk)k + \{p{l + pk)kf 
= i- pk -(i-(ktpl^ pk f < x-pk. □ 

The lemma applies in our setting, since p < j- = O(^) and all groups are bounded \G a \ < 
2 • n 1 " 1 ^. Note that p ■ \G a \ < 1 ■ 2n 1 - 1 / c = O^/n 1 / ). Plugging into ([lS]): 



a>-q c a>-qc 

Let m = n ■ (1 + l/n 1 ^). The final result follows by integration over p: 

p£/n p£/n 

Pv[h(q) < mmh(S)] > / f(p)dp> (I - p) m dp 

Jo Jo 



-(1-p) 



m+l 



m + 1 



e/n 



1 - (1 -l/n) m+1 



l-e~ e 1-1/n 1 / OClgn) 

> > ' 1 



m + l ' n(l + £/n 1 /c) ' n \ n l l c 
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5.2 Upper bound 

As in the lower bound, it will suffice to look at the case when q lands in the minimum bin: 

r£/n 

Pr[%) < h(S)] < Pr[minfc(5) >^]+ Pr[%) < h(S) A h(q) < £] < £ + / f(p)dp 

Jo 

To bound f(p), we will fix position-characters in the order -< from Lemma[7j subject to h(q) = p. 
In the lower bound, we could analyze the choice of h(a) even for the worst-case choice of h(~<a). 
Indeed, no matter how the keys in G a arranged themselves, when shifted randomly by h(a), they 
failed to land below p with probability 1 — p\G a \ > (1 — p)( 1+ °( 1 ))l G ' Q l . 

For an upper bound, we need to prove that keys from G a do land below p often enough: 
Pr[p < mmh(G a ) \ h(~<a)] < (1 — p)^~ °W^\ Ga ^. However, a worst-case arrangement of G a could 
make all keys equal, which would give the terrible bound of just 1 — p. 

To refine the analysis, we can use Lemma [4j which says that for d = O(l), all groups G a are 

d-bounded with probability > 1 \. If G a is d-bounded, its keys cannot cluster in less than 

[|G Q |/d] different bins. 

When a group G a has more than one key in some bin, we pick one of them as a representative, 
by some arbitrary (but fixed) tie-breaking rule. Let R a be the set of representatives of G a . Observe 
that the set R a C G a is decided once we condition on h{<a). Indeed, the hash codes for keys in 
G a are decided up to a shift by h(a), and this common shift cannot change how keys cluster into 
bins. We obtain: 

Pr[p < mmh(G a ) \ h(~<a)] < Pr[p < min h{R a ) \ h{-< a)} = I - p\R a \ < (1-j>) |/?q| 

To conclude Pr[p < mmh(R a )] = 1 — p\R a \ we used that the representatives are in different bins, 
so at most one can land below p. Remember that \R a \ is a function of h{-<a). By d-boundedness, 
\R a \ > \G a \/d, so we get Pr[p < mmh(G a ) \ h{-< a)} < (1 - p)\ G <*\/ d for almost all h(-< a). 
Unfortunately, this is a far cry from the desired exponent, \G a \ ■ (l — 0(n _1//c )). 

To get a sharper bound, we will need a dynamic view of the representatives. After fixing 
h{-< a), we know whether two keys x and y collide whenever the symmetric difference xAy = 
(x \ y) U (y \ x) consists only of position-characters -< a. Define R/3(a) to be our understanding 
of the representatives Rp just before character a is revealed: from any subset of Gr that is known 
to collide, we select only one key. After the query characters get revealed, we don't know of any 
collisions yet (we know only one character per position), so Rp(ao) = Gp. The set of representatives 
decreases in time, as we learn about more collisions, and Rp^fi) = Rp is the final value (revealing 
j3 doesn't change the clustering of Gp). 

Let C(a) be the number of key pairs (x,y) from the same group Gp (/3 >~ a) such that a = 
max^(xAy). These are the pairs whose collisions is decided when h{a) is revealed, since h(a) is the 
last unknown hash code in the keys, besides the common ones. Let a + be the successor of a in the 
order -<. Consider the total number of representatives before and after h(a) is revealed: |-Rg(a)| 
versus |i?^(a + )|. The maximum change between these quantities is < C(a), while the expected 
change is < C(a) ■ ^. This is because h(a) makes every pair (x,y) collide with probability ^, 
regardless of the previous hash codes in (xAy) \ {a}. Note, however, that the number of colliding 
pairs may overestimate the decrease in the representatives if the same key is in multiple pairs. 

Let n{>~a) = Ylpya an d define n(^a) simmilarly. Our main inductive claim is: 
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Lemma 22. For any setting h(~<a) such that h(q) = p and l-R^a)! = r, we have: 



Pr 



(p < min |^J h(G / 3) S j A (\/a)G a d-bounded | h(-<a) < P(a,p,r) 



where we define P(a,p,r) = (1 — p) r + (1 — p) r , 



tea ' 

As the definition P(a,p,r) may look intimidating, we first try to demystify it, while giving a 
sketch for the lemma's proof (the formal proof appears in £ |5,3| ) The lemma looks at the worst- 
case probability, over prior choices h{-< a), that p = h(q) remains the minimum among groups 
G a ,G a +, . . . . After seeing the prior hash codes, the number of representatives in these groups 
is r = ^2py a \Rp(ot)\. In the ideal case when h(a), h(a + ), . . . do not introduce any additional 
collisions, we have r representatives that could beat p for the minimum. As argued above, the 
probability that p is smaller than all these representatives is < (1 — p) r . Thus, the first term of 
P(a,p,r) accounts for the ideal case when no more collisions occur. 

On the other hand, the factor (1 — p) n (- Q )/( 2c accounts for the worst case, with no guarantee 
on the representatives except that the groups are d-bounded (the 2 in the exponent is an artifact). 
Thus, P(a,p,r) interpolates between the best case and the worst case. This is explained by a 
convexity argument: the bound is maximized when h(a) mixes among two extreme strategies — it 
creates no more collisions, or creates the maximum it could. 

It remains to understand the weight attached to the worst-case probability. After fixing h(a), 
the maximum number of remaining representatives is f = Y^Bya l-^/3( a )l- The expected number is 
> r — C(a)^, since every collision happens with probability ^. By a Markov bound, the worst case 
(killing most representatives) can only happen with probability 0(j^C{a)/r). The weight of the 
worst case follows by r > n{>~a)/d and letting these terms accrue in the induction for f3 >~ a. 

Deriving the upper bound.. We now prove the upper bound on Pr[h(q) < h(S)] assuming 



Lemma 22 Let ao be the first position-character fixed after the query. Since fixing the query 
cannot eliminate representatives, 

Pr[p < mmh(S) A (\/a)G a d-bounded | h(q) = p] < P(ao,p,n) 

Lemma 23. P(a ,p,n) < (1 - p) n + (l- p )™/( 2 <*) . 0< ^f /c n) . 

Proof. We will prove that A = Ylp^ao n(yp) — nl_1 ^ c " H n , where H n is the Harmonic number. 

Consider all pairs (x, y) from the same group GL, and order them by f3 = max^(xAy). This is 
the time when the pair gets counted in some C{f3) as a potential collision. The contribution of the 
pair to the sum is l/n{>~ /3), so this contribution is maximized if (3 immediately precedes 7 in the 

order -<. That is, the sum is maximized when CQ3) = ( |G § +I ). We obtain A < ^ ^/n(^/3) < 
n i-i/c . \Gp\/n(y/3). In this sum, each key x 6 Gp contributes l/n(^/3), which is bounded by 
one over the number of keys following x. Thus A < H n . □ 

To achieve our original goal, bounding Pr[h(q) < h(S)}, we proceed as follows: 

Pr[%) < h(S)\ < 4 + / Pr[p < minh(S) | %) = p]dp 

Jo 

< ^7 + Pr[(3a)G Q not d-bounded] + / P(ao,p, n)dp 

Jo 
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By Lemma 4l all groups are (i-bounded with probability 1 — \. We also have 



i/n _n_p\n+l 



pt/n 

/ (1-P) n dp 
Jo 



n + 1 



i/n 



1 

< 



p=0 n + 1 



Thus: 

rr n+1 n/(2d) + 1 n l ' c n \ n L ' c 
5.3 Proof of Lemma 1221 

Recall that we are fixing some choice of h(~<a) and bounding: 

A = Pr [p < min (J /i(G,g) A (Va)G a (i-bounded | h(-<a)] 

If for some /3, \Rp(a)\ < \Gp\/d, it means not all groups are d-bounded, so A = 0. If all groups are 
d-bounded and we finished fixing all position-characters, A = 1. These form the base cases of our 
induction. 

The remainder of the proof is the inductive step. We first break the probability into: 
A V A 2 = Pr [p < mm h(G a ) \ h(^a)]-Pi [ (J h(Gp) A (Va)G a d-bounded | h(-< a),p> mm h(G a )\ 

As h(a) is uniformly random, each representative in R a has a probability of p of landing below 
p. These events are disjoint because p is in the minimum bin, so A\ = 1 — p ■ \R a \ < (1 — p)'^™'. 

After using R a , we are left with f = r — \R a \ = Ylpya representatives. After h(a) is 

chosen, some of the representative of r are lost. Define the random variable A = Ylsya — 
| ^f?^ (ct + ) | ) to measure this loss. 

Let A max > f— be a value to be determined. We only need to consider A < A max . Indeed, 
if more than A max representatives are lost, we are left with less than n{>~a)/d representatives, so 
some group is not (i-bounded, and the probability is zero. We can now bound A 2 by the induction 
hypothesis: 

/\max 

A 2 < Pr[A = 5 | h(^a),p > mmh(G a )] • P{a + ,p,f- 5) 
5=0 

where we had P(a + , p, r - 5) = (1 - pY~ s + (1 - p) n (^/W • V 4C( f ) - ¥M . 

v ' v ' v ; ^ n(y8)/d 

Observe that the second term of P(a + ,p, r — 5) does not depend on 5 so: 
M < A 3 + (1 - pT^)/^ ■ £ 4C f 'W? 

y\ max 

where A 3 = ^ Pr[A = 5 \ h(-<a),p > mm h(G a )] ■ (1 - pf~ 5 . 
5=0 

It remains to bound ^3. We observe that (1 — p) r ~ s is convex in 5, so its achieves the maximum 
value if all the probability mass of A is on and A max , subject to preserving the mean. 
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Observation 24. We have: E[A | h(-<a),p > mmh(G a )] < 2 • C(a) ■ 

Proof. As discussed earlier, a representative disappears when we have a pair x, y G Rp{oi) that lands 
in the same bin due to /i(a). This can only happen if (x, y) is counted in C(a), i.e. a = max^ (xAy). 
If /i(a) is uniform, such a pair (x,y) collides with probability — , regardless of h(-<a). By linearity 
of expectation E[A | /i(-<a)] < C(a) ■ ^. 

However, we have to condition on the event p > min h(G a ), which makes h(a) non-uniform. 
Since p < ^ and \G a \ < n 1_1 / c , we have Pr[p < mmh(G a )] < 1/2. Therefore, conditioning on this 
event can at most double the expectation of positive random variables. □ 

A bound on A3 can be obtained by assuming Pr[A = A max ] = (2 • C(a) ■ -) /A max , and all the 
rest of the mass is on A = 0. This gives: 

A 3 < (1 _ p y + 2 • <M ■ ^ . (1 _ p) -a— 

Remember that we promised to choose A max > r — "^• > . We now fix A max = r — 2^ ■ We are 
guaranteed that f > ^ , since otherwise some group is not d-bounded. This means A max > TO ^*^ . 
We have obtained a bound on A3: 

n(^aj/(2a) 

=>■ A < (1 - p )\k«\ . (1 - P y-\R.\ + (1 _ p) K^)/(2^) . 4g (f) • (W 



This completes the proof of Lemma 22 , and the bound on minwise independence. 



6 Fourth Moment Bounds 

Consider distributing a set S of n balls into m bins truly randomly. For the sake of generality, let 
each element have a weight of Wi. We designate a query ball q ^ S, and let W be the total weight of 
the elements landing in bin F(h(q)), where F is an arbitrary function. With [i = E[W] = — ^2w{, 
we are interested in the 4 th moment of the bin size: E[(H^ — /x) 4 ]. 

Let Xi be the indicator that ball i G S lands in bin F(h(q)), and let Yi = X{ — ^. We can 
rewrite W — fx = J2i ^% w ii so: 

E[(W-p) 4 } = Y, mwjWkWrEiYiYjYk^]. (19) 

i,j,k,l£S 

The terms in which some element appears exactly once are zero. Indeed, if i ^ {j,k,l}, then 
E[liljY/%Y/] = E[li] • EfYjlfcY/], which is zero since E[Yi] = 0. Thus, the only nonzero terms arise 
from: 

• four copies of one element (i = j = k = I), giving the term ± 0(^2))wf. 
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• two distinct elements s ^ t, each appearing twice. There are L) = 6 terms for each s, t pair, 
and each term is 0(^z)WgW^. 

This gives the standard 4 th moment bound: 

= ^2> 4 +^(E-* 2 ) 2 - (20) 

i i 

This bound holds even if balls are distributed by 5-independent hashing: the balls in any 4- 
tuple hit the bin chosen by h(q) independently at random. On the other hand, with 4-independent 



hashing, this bound can fail quite badly PT10 



If the distribution of balls into bins is achieved by simple tabulation, we will show a slightly 



weaker version of (20): 



In £6.1 



E[(W - fi) 4 ] = iE^ + °(^ + S)'(l><) 2 - < 21 > 

i i 

we show how to analyze the 4 th moment of a fixed bin (which requires 4-independence 
by standard techniques). Our proof is a combinatorial reduction to Cauchy-Schwarz. In £6.2, we 
let the bin depend on the hash code h(q). This requires 5-independence by standard techniques. 
To handle tabulation hashing, |6.3| shows a surprising result: among any 5 keys, at least one hashes 
independently of the rest. 

We note that the bound on the 4 th moment of a fixed bin has been indendently discovered 



by BCL + 10 in a different context. However, that work is not concerned with a query-dependent 



bin, which is the most surprising part of our proof. 
6.1 Fourth Moment of a Fixed Bin 



We now attempt to bound the terms of ( 19 ) in the case of simple tabulation. Since simple tabulation 
is 3-independent |WC81 , any terms that involve only 3 distinct keys (i.e. \{i,j, k,l}\ < 3) have the 
same expected value as established above. Thus, we can bound: 

mw-rf] = ^£^ 4 + ^(£^ 2 ) 2 + E wiWjWkvn-mYjYkYj]. 

Unlike the case of 4-independence, the contribution from distinct i, j, k, I will not be zero. We begin 
with the following simple bound on each term: 

Claim 25. For distinct i,j,k,l, B\YlYjY k Yi\ = O(^). 

Proof. We are looking at the expectation of Z = (JQ - ^)(Xj - ^)(X k - £)(Xl ~ Note that 
Z is only positive when an even number of the four X's are 1: 

1. the case X{ = Xj = X k = Xi = 1 only happens with probability -K by 3-independence. The 
contribution to Z is (1 — M 4 <1. 

2. the case of two l's and two 0's happens with probability at most ( 2 )^2j an d contributes 

: j 2 < ', to /.. 

3. the case of Xi = Xj = Xu = Xr = contributes -K to Z. 
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Thus, the first case dominates and E[Z] = 0(^ s ). □ 

If one of {i,j, k, 1} contains a unique position-character, its hash code is independent of the 
other three. In this case, the term is zero, as the independent key factors out of the expectation 
and E[ii] = 0. We are left with analyzing 4-tuples with no unique position-characters; let ACS 4 
contain all such 4-tuples. Then: 

^2 WiWjW k wi • ElYjYjYkYi] = O (^) • ^ WiWjW k wi- 
i¥=i¥=k¥=l (i,j,k,l)eA 

Imagine representing a tuple from A as a 4 x q matrix, with every key represented in a row. 
There are four types of columns that we may see: columns that contain a single character in all 
rows (type 1), and columns that contain two distinct characters, each appearing in two rows (type 
j G {2,3,4} means that row j has the same character as row 1). According to this classification, 
there are 4 q possible matrix types. 

Claim 26. Fix a fixed matrix type, and let B C A contain all tuples conforming to this type. Then, 

Proof. We first group keys according to their projection on the type-1 characters. We obtain a 
partition of the keys S = S\ U S2 U • • • such that St contains keys that are identical in the type-1 
coordinates. Tuples that conform to the fixed matrix type, k, I) £ B, must consist of four keys 
from the same set, i.e. i,^,k,l £ St- Below, we analyze each St separately and bound the tuples 

from (S t ) 4 by (J2i<=s t w i) ■ Tllis im P nes tne lemma by convexity, as ^t &ieS t w f) - &i w D ■ 
For the remainder, fix some St. If \St\ < 4, there is nothing to prove. Otherwise, there must 
exist at least one character of type different from 1, differentiating the keys. By permuting the set 
{i,j, k, I}, we may assume a type-2 character exists. Group keys according to their projection on 
all type-2 characters. We obtain a partition of the keys St = T\ U T2 U ■ ■ ■ such that T a contains 
keys that are identical in the type-2 coordinates. 

A type-conforming tuple k,l) £ B must satisfy i,j £ T a and k,l £ Tb for a 7^ b. We claim 
a stronger property: for any i,j £ T a and every b 7^ a, there exists at most one pair k,l £ 
completing a valid tuple k, I) £ B. Indeed, for type-1 coordinates, k and I must be identical to 
i on that coordinate. For type 3 and 4 coordinates, k and I must reuse the characters from i and j 
(k 4— i,l <— j for type 3; k 4— j,l 4— i for type 4). 

Let X C (T a ) 2 contain the pairs i,j £ T a which can be completed by one pair k,l £ T^. Let 
Y C (Tb) 2 contain the pairs k,l £ Tf, which can be completed by i,j £ T a . There is a bijection 
between X and Y; let it be / : X h-> Y. We can now apply the Cauchy-Schwarz inequality: 

^2 WiWjW k Wl = ^ (WiWj) • [w k wi) 

(i,j,k,i)eBn{T a xT b ) (i,j)ex, {k,i)=f(i,j) 



< 



( w i w j) 2 Y ( W k W lf 

K {i,j)€X J \(k,l)eY 



But j)^x w "i w2 j — ( SieT a w i) 2 ■ Thus, the equation is further bounded by 
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Summing up over all T a and we obtain: 

Y w iWj w k wi < ( Yl w i ) [Yl w k) ^ E w « 

(i,j,k,l)£Bn{S t ) 4 a,b ViGT a / \keT b J VisS* / 

This completes the proof of the claim. □ 



The bound of Claim 26 is multiplied by 4 9 , the number of matrix types. We have thus shown 



(21) 



6.2 Fourth Moment of a Query-Dependent Bin 

We now aim to bound the 4 th moment of a bin chosen as a function F of h(q) , where q is a designated 
query ball. This requires dealing with 5 keys (i,j,k,l and the query q). Even though simple 
tabulation is only 3-independent, we will prove the following intriguing independence guarantee in 



£6.3 



Theorem 27. With simple tabulation, in any fixed set of 5 distinct keys, there is a key whose hash 
is independent of the other 4 hash codes. 

As a side note, we observe that this theorem essentially implies that any 4-independent tabula- 
tion based scheme is also 5- independent. In particular, this immediately shows the 5-independence 



of the scheme from TZ04 (which augments simple tabulation with some derived characters) . This 



fact was already known |TZ09|, albeit with a more complicated proof. 



In the remainder of this section, we use Theorem 27 to derive the 4 th moment bound (21). As 



before, we want to bound terms WiWjWkWi ■ E[YjYjYfcYi] for all possible configurations of (i,j, k, I). 
Remember that q £ S, so q £ {i,j, k, I}. These terms can fall in one of the following cases: 

• All keys are distinct, and q hashes independently. Then, the contribution of i,j,k,l to bin 
F(h(q)) bin is the same as to any fixed bin. 

• All keys are distinct, and q is dependent. Then, at least one of {i, j, k, 1} must be independent 
of the rest and q; say it is i. But then we can factor i out of the product: E[YjljY/%Y;] = 
E[Yj] • ElYjYkYi). The term is thus zero, since E[Yj] = 0. 

• Three distinct keys, k, l}\ = 3. This case is analyzed below. 

• One or two distinct keys: \{i,j, k, l}\ < 2. By 3-independence of simple tabulation, all hash 
codes are independent, so the contribution of this term is the same as in the case of a fixed 
bin. 

To summarize, the 4 th moment of bin F(h(q)) is the same as the 4 th moment of a fixed bin, 
plus an additional term due to the case \{i,j, k,l}\ = 3. The remaining challenge is to understand 
terms of the form wfwjWkE^Y^YjY^ . We first prove the following, which is similar to Claim 25 

Claim 28. For distinct i,j,k, E[Y?YjY k ] = O(^). 

Proof. By 3-independence of simple tabulation, Yi and Yj are independent (these involve looking 
at the hashes of i,j,q). For an upper bound, we can ignore all outcomes Y?YjYk < 0, i.e. when Yj 



j 



and Yjt have different signs. On the one hand, Yj = Yfc = 1 — -jr with probability O(-k-). On the 
other hand, HYj =Y] C = — the contribution to the expectation is 0(-\). □ 
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Assume Wj > by symmetry. If k hashes independently of {i,j,q}, the term is zero, since 
E[Yfc] = can be factored out. Otherwise, the term contributes 0{wfvjj/m 2 ) to the sum. 

Claim 29. For any distinct i,j, q, there is a unique key k such that h(k) depends on h(i), h(j), h(k). 

Proof. We claim that if any of {i, j, k, q} has a unique position-character, all keys are independent. 
Indeed, the key with a unique position-character is independent of the rest, which are independent 
among themselves by 3- independence. 

Thus, any set {i, j, q} that allows for a dependent k cannot have 3 distinct position-characters on 
one position. In any position where i,j, and q conincide, k must also share that position-character. 
If i,j, and q contain two distinct distinct characters on some positon, k must contain the one that 
appears once. This determines k. □ 

For any i and j, we see exactly one set {i,j, k} that leads to bad tuples. By an infinitesimal 
perturbation of the weights, each such set leads to ( 2 ) = 6 tuples: we have to choose two positions 
for i, and then j is the remaining key with larger weight. Thus, the total contribution of all terms 

~tO(X^ w i ) 2 - This completes the proof of plj). 



k, I) with 3 distinct keys is 0(J2i j ~^~) = O 



6.3 Independence Among Five Keys 

The section is dedicated to proving Theorem [27) We first observe the following immediate fact: 



Fact 30. If, restricting to a subset of the characters (matrix columns), a key x £ X hashes inde- 
pendently from X \ {x}, then it also hashes independently when considering all characters. 

If some key contains a unique character, we are done by peeling. Otherwise, each column 
contains either a single value in all five rows, or two distinct values: one appearing in two rows, 



and one in three rows. By Fact 30, we may ignore the columns containing a single value. For 



the columns containing two values, relabel the value appearing three times with 0, and the one 



appearing twice with 1. By Fact 30 again, we may discard any duplicate column, leaving at most 
(2) distinct columns. 

Since the columns have weight 2, the Hamming distance between two columns is either 2 or 4. 



Lemma 31. If two columns have Hamming distance 4, on e hash value is independent. 



Proof. By Fact 30, we ignore all other columns. Up to reordering of the rows, the matrix is: 



1 

1 

1 
1 
1 1 



By 3-independence of character hashing, keys 1, 3, and 5 are independent. But keys 2 and 4 are 
identical to keys 1 and 3. Thus, key 5 is independent from the rest. □ 

We are left with the case where all column pairs have Hamming distance 2. By reordering of 
the rows, the two columns look like the matrix in (a) below. Then, there exist only two column 
vectors that are at distance two from both of the columns in (a): 

(a) 



■ 


' 




■ ■ 




■ 1 ■ 




1 


1 




(b) 


1 
1 


(c) 






1 


1 







1 


1 


1 




1 




1 
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If the matrix does not contain column (b), then keys 4 and 5 are identical, a contradiction. 
Thus, the matrix must contain columns (a) and (b), with (c) being optional. If (c) appears, discard 
it by Fact 30 We are left with the matrix: 



■ 





o ■ 





1 


1 


1 





1 


1 


1 





1 


1 


1 



Now, observe that the hash code of row 1 is just the xor of the hash codes of rows 2-4, 
h(l) = h(2) © hc(3) © /ic(4). Indeed, the codes of the one characters in each column cancel 
out, leaving us with an xor of the zeros in each column. We claim that row 5 is independent of 
rows 2-4. This immediately implies that row 5 is independent of all others, since row 1 is just a 
function of rows 2-4. 

Independence of row 5 from rows 2-4 follows by peeling. Each of rows 2, 3, and 4 have a position 



character not present in 5, so they are independent of 5. This completes the proof of Theorem 27 



6.4 Linear Probing with Fourth Moment Bounds 

As in Section [3] we study linear probing with n stored keys in a table of size m, and a query q not 
among the stored keys. We define the fill a = n/m and e = 1 — a. Pagh et al. [PPR09| presented 
a proof that with 5-independent hashing, the expected number probes is 0(l/e r6 ^). We will here 
improve this to the optimal 0(l/e 2 ), which is optimal even for a fully random hash function. For 
the case of smaller fill, where a < 1/2, Thorup |Tho09 proved that the expected number of filled 



entries probes is 0(a) which is optimal even for fully random functions. 

As discussed in Section[3j our goal is to study the length L of the longest filled interval containing 
a point p which may depend on h(q), e.g., p = h(q). To bound the probability that an interval / is 
full, we study more generally the case how the number Xj of keys hashed to / deviates from the 
mean a\I\: if / is full, the deviation is by more than ea|J|. 

As we mentioned earlier, as an initial step Pagh et al. |PPR09 proved that if we consider the 



number of keys Xj in an interval / which may depend on the hash of a query key, then we have 
the following 4 th unweighted moment bound: 

Pr[X / >A + a|/|] = o( a|J|+ A ( 4 a| ^ ) (22) 



This is an unweighted version of (21) so (22) holds both with 5-independent hashing and with 
simple tabulation hashing. 

As with our simple tabulation hashing, for each i, we consider the event Cj^p that for some point 
p that may depend on the hash of the query key, there is some an interval I 3 p, 2 i < \ I\ < 2 l+1 



with relative deviation 5. In perfect generalization of (22), we will show that (22) implies 



,„ . n fa2 i + (a2 i ) 2 \ . . 

^-"(w-j (23) 

First consider the simple case where 5 > 1. We apply Claim [9} Since S > 1, we get j = i — 3. 



The event Ci,s, P implies that one of the 2+1 relevant j-bins has (1 + 2)°^ keys. By (22), the 
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probability of this event is bounded by 



(2 5 + l)0 



a2? + {a2if 
(fa2i) 4 



O 



a2 l + (a2 



i\2 



(6a2 



iYl 



This completes the proof of (23) when S > 1. 

Now consider the case where S < 1. If a2* < 1, (23) does not give a probability bound below 1, 
so we can assume a2 l > 1. Then (23) simplifies to 



Pv[C iAp ] = 0(l/(5\a2 l f) 



(24) 



This time we will apply Claim [TOj Recall that a j-bin is dangerous for level i if its absolute deviation 
is A,- 



£"21/2(* — 3 " 5 . We defined jo be the smallest non-negative integer satisfying Aj 0i j < a2 J0 . 
If Ci,8,p happens, then for some j G {jo, i}, one of the 2 l ~i +2 + 1 relevant j-bins is dangerous for 
level i. By (22), the probability of this event is bounded by Yl)=j O(Pj) w here 



R = 2 1 ^ 



oO? + {a2if 



A 4 . 

hi 



O 2 l ~i 



ati + (a2^') s 



{8a2 i /2^-j)/^y ) 

Let ji = flog 2 (l/a)l. Note that j x < i. For j > j x , we have a2 j + (a2^) 2 = 0((a2^) 2 ) 



so 



O 2^ 



(Q2?) 1 



(<5a2V2(*-i)/ 5 ) 



< 5 4 (a2 i ) 2 25( i -^) 



We see that for j > jx, the bound decreases exponentially with j, so 



E P i = ° (l/(«*(«2*) 2 )) • 



(25) 



This is the desired bound from (24) for PrfC^^p 
io < ii- By definition, we have a2 n < 1 and A^ < 1, so 



so we are done if j\ < jo- However, suppose 



P, 



.a 



2 l ~ 3 



h a2 h + {a2^f 
' AT 

1,31 



0(1). 



This means that there is nothing to prove, for with d25b, we conclude that (l/(5 4 (a2 i ) 2 )) = 0(1). 
Therefore (24) does not promise any probability below 1. This completes the proof of (23). As in 
Theorem [8 we can consider the more general event T^e,s,p that there exists an interval I containing 
p and of length at least £ such that the number of keys Xj in / deviates at least 5 from the mean. 

'a£ + {a£) 2s 



Then, as a perfect generalization of ( 23 ) , we get 



Pr[£> 



O 



i\2 



a2 l + {a2 l ) 
(Sa2 i ) 



O 



(5a£) 4 



(26) 



In the case of linear probing with fill a = 1 — e, we worry about filled intervals. Let L be the length 
of the longest full interval containing the hash of a query key. For e < 1/2 and a > 1/2, we use 
5 = e and 

Pv[V i>e>p ) = 0(l/(^ 4 )) 
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so 



E[L] < $>r[D,, E)P ] = X>m{l,0(l/(^ 4 ))} = 0(l/e 2 ), 



1=1 



improving the 0(l/e"6") bound from PPR09] . However, contrasting the bounds with simple tab- 
ulation, the concentration from (26) does not work well for higher moments, e.g., the variance 
bound we get with e = 1/2 is O(logro), and for larger moment p > 3, we only get a bound of 
0(n p /(n 2 )) = 0{nP~ 2 ). 

Now consider a < 1/2. We use 5 = 1/(2q) noting that (1 + 5)a\I\ = (a + 1/2)|/| < Then 



Pr[£> 



e,l/(2a),p\ 



O 



a£+(a£) 2 



In particular, as promised in (15), we have 



Pr[L > 0] = Pr[X> ljl/(2a)>p ] = 0(a + a 2 ) = 0(a) 
More generally for the mean, 

m 

E[L] < X)Pr[X> A1/(2a)>p ] 
'a£ + (a£) 2 



= Eo 

= O{o). 

This reproves the bound from Thorup |Tho09| . 



A Experimental Evaluation 

In this section, we make some simple experiments comparing simple tabulation with other hashing 
schemes, both on their own, and in applications. Most of our experiments are the same as those 
in TZ09I except that we here include simple tabulation whose relevance was not realized in TZ09 



We will also consider Cuckoo hashing which was not considered in TZ09 . 

Recall the basic message of our paper that simple tabulation in applications shares many of 
the strong mathematical properties normally associated with an independence of at least 5. For 
example, when used in linear probing, the expected number of probes is constant for any set of input 
keys. With sufficiently random input, this expected constant is obtained by any universal hashing 



scheme MV08 , but other simple schemes fail on simple structured inputs like dense intervals or 



arithmetic progressions, which could easily occur in practice TZ09 . 
Our experiments consider two issues: 

• How fast is simple tabulation compared with other realistic hashing schemes on random input? 
In this case, the quality of the hash function doesn't matter, and we are only comparing their 
speed. 

• What happens to the quality on structured input. We consider the case of dense intervals, 
and also the hypercube which we believe should be the worst input for simple tabulation since 
it involves the least amount of randomness. 
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We will now briefly review the hashing schemes considered in our experiments. The focus will 
be on the common cases of 32 and 64 bit keys. If the initial keys are much bigger, we can typically 
first apply universal hashing to get down to a smaller domain, e.g., collision free down to a domain 
of size n 2 . To achieve expected O(l) time for linear probing, it suffices to map universally to a 
domain of just 0(n) |Tho09 . 



A.l Multiplication-shift Hashing 

The fastest known hashing schemes are based on a multiplication followed by a shift. 



Univ-mult-shift. If we are satisfied with plain universal hashing, then as shown in DHKP97 
we pick a random odd number a from the same £-bit domain as the keys. If the desired output is 
^out-bit keys, we compute the universal hash function: 

h a {x) = (a*x)»(£- t ou t). 

This expression should be interpreted according to the C programming language. In particular, * 
denotes standard computer multiplication where the result is truncated to the same size as that of 
its largest operand. Here this means multiplication modulo 2 . Also, » is a right shift taking out 
least significant bits. Mathematically, this is integer division by 2^ out . Note that this scheme is 
far from 2-independent, e.g., if two keys differ in only their least significant bit, then so does their 
hash values. However, the scheme is universal which suffices, say, for expected constant times in 
chaining. 



2- indep- mult-shift. For 2-independent hashing, we use the scheme from |Die96 . We pick a 



random 2£-bit multiplier a (which does not need to be odd), and a 2t bit number b. Now we 
compute: 

h a ,b(x) = (a*x+b)»{2£-£ out ). 

This works fine with a single 64-bit multiplication when t = 32. For t = 64, we would need to 
simulate 128-bit multiplication. In this case, we have a faster alternative used for string hashing 
|Tho09 , viewing the key x as consisting of two 32-bit keys x\ and X2- For a 2-independent 32-bit 



output, we pick three random 64-bit numbers a\ and ci2 and b, and compute 

^01,02,6(^1^2) = ((ai+x 2 )*(a2+xi)+6)»32. 

Concatenating two such values, we get a 64-bit 2-independent hash value using just two 64-bit 
multiplications. 

A. 2 Polynomial Hashing 

For general fe-independent hashing, we have the classic implementation of Carter and Wegman 
|WC81| by a degree k — 1 polynomial over some prime field: 

/fc-i \ 



h(x) = \ y~ y j aix 1 mod pj mod 2 iout (27) 
for some prime p 3> 2 e ° ut with each ai picked randomly from [p]. If p is an arbitrary prime, this 



method is fairly slow because 'mod p' is slow. However, Carter and Wegman CW79 pointed out 
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that we can get a fast implementation using shifts and bitwise Boolean operations if p is a so-called 
Mersenne prime of the form 2 l — 1. 

5-indep-Mersenne-prime. We use the above scheme for 5-independent hashing. For 32-bit keys, 
we use p = 2 61 — 1, and for 64-bit keys, we use p = 2 89 — 1. 

For the practical implementation, recall that standard 64-bit multiplication on computers dis- 
cards overflow beyond the 64 bits. For example, this implies that we may need four 64-bit multipli- 
cations just to implement a full multiplication of two numbers from [2 61 — 1]. This is why specialized 
2-independent schemes are much faster. Unfortunately, we do not know a practical generalization 
for higher independence. 



A. 3 Tabulation-Based Hashing 

The basic idea in tabulation based schemes is to replace multiplications with lookups in tables that 
are small enough to fit in fast memory. 

simple-table. Simple tabulation is the basic example of tabulation based hashing. A key x = 
x\ ■ ■ ■ x c is divided into c characters. For % = 1 . . c, we have a table T% providing a random value Ti[x«] 
with a random value, and then we just return the xor of all the Tjfccj]. Since the tables are small, it 
is easy to populate them with random data (e.g. based on atmospheric noise http://random.org). 
Simple tabulation is only 3-independent. 

We are free to chose the size of the character domain, e.g., we could use 16-bit characters 
instead of 8-bit characters, but then the tables would not fit in the fast LI cache. The experiments 



from TZ09 indicate that 8-bit characters give much better performance, and that is what we use 
here. 

5-indep-TZ-table. To get higher independence, we can compute some additional "derived char- 
acters" and use them to index into new tables, like the regular characters. Thorup and Zhang 



TZ04 , TZ09 presented a fast such scheme for 5-independent hashing. With c = 2 characters, they 
simply use the derived character x\ + xi- For c > 2, this generalizes with c — 1 derived characters 
and a total of 2c — 1 lookups for 5-independent hashing. The scheme is rather complicated to 
implement, but runs well. 



A. 4 Hashing in Isolation 

Our first goal is to time the different hashing schemes when run in isolation. We want to know 
how simple tabulation compares in speed to the fast multiplication-shift schemes and to the 5- 
independent schemes whose qualities it shares. We compile and run the same C code on two 
different computers: 

32-bit computer: Single-core Intel Xeon 3.2 GHz 32-bit processor with 2048KB cache, 32-bit 
addresses and libraries. 

64-bit computer: Dual-core Intel Xeon 2.6 GHz 64-bit processor with 4096KB cache, 64-bit 
addresses and libraries. 

Table [T] presents the timings for the different hashing schemes, first mapping 32-bit keys to 32- 
bit values, second mapping 64-bit keys to 64-bit values. Not surprisingly, we see that the 64-bit 
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Hashing random keys 


32-bit computer 


64-bit computer 


bits 


hashing scheme 


hashing time (ns) 


32 


univ- mult-shift 


1.87 


2.33 


32 


2-indep- mult-shift 


5.78 


2.88 


32 


5-indep-Mersenne-prime 


99.70 


45.06 


32 


5-indep-TZ-table 


10.12 


12.66 


32 


simple-table 


4.98 


4.61 


64 


univ- mult-shift 


7.05 


3.14 


64 


2-indep- mult-shift 


22.91 


5.90 


64 


5-indep-Mersenne-prime 


241.99 


68.67 


64 


5-indep-TZ-table 


75.81 


59.84 


64 


simple-table 


15.54 


11.40 



Table 1: Average time per hash computation for 10 million hash computations. 

computer benefits more than the 32-bit computer when 64-bit multiplications is critical; namely in 
univ-mult-shift for 64 bits, 2-indep-mult-shift, and 5-indep-Mersenne-prime. 

As mentioned, the essential difference between our experiments and those in [TZ09 is that 



simple tabulation is included, and our interest here is how it performs relative to the other schemes. 
In the case of 32-bits keys, we see that in both computers, the performance of simple tabulation 
is similar to 2-indep-mult-shift. Also, not surprisingly, we see that it is more than twice as fast as 
the much more complicated 5-indep-TZ-table. 

When we go to 64-bits, it may be a bit surprising that simple tabulation becomes more than 
twice as slow, for we do exactly twice as many look-ups. However, the space is quadrupled with 
twice as many tables, each with twice as big entries, moving up from 1KB to 8KB, so the number 
of cache misses may increase. 

Comparing simple tabulation with the 2-indep-mult-shift, we see that it is faster on the 32-bit 
computer and less than twice as slow on the 64-bit computer. We thus view it as competitive in 
speed. 

The competitiveness of simple tabulation compared with multiplication-shift based methods 
agrees with the experiments of Thorup [Tho00| from more than 10 years ago on older computer 



architectures. The experiments from |Tho00 did not include schemes of higher independence. 

The competitiveness of our cache based simple tabulation with multiplication-shift based meth- 
ods is to be expected both now and in the future. One can always imagine that multiplication 
becomes faster than multiplication, and vice versa. However, most data processing involves fre- 
quent cache and memory access. Therefore, even if it was technically possible, it would normally 
wasteful to configure a computer with much faster multiplication than cache. Conversely, how- 
ever, there is lot of data processing that does not use multiplication, so it is easier to imagine real 
computers configured with faster cache than multiplication. 

Concerning hardware, we note that simple tabulation is ideally suited for parallel lookups of 
the characters of a key. Also, the random data in the character tables are only changed rarely in 
connection with a rehash. Otherwise we only read the tables, which means that we could potentially 
have them stored in simpler and faster EEPROM or flash memory. This would also avoid conflicts 
with other applications in cache. 
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Linear probing with random keys 


32-bit computer 


64-bit computer 


hashing scheme 


update time (nanoseconds) 


univ- mult-shift 


141 


149 


2-indep- mult-shift 


151 


157 


5-indep-Mersenne- prime 


289 


245 


5-indep-TZ-table 


177 


211 


simple-table 


149 


166 



Table 2: Linear probing with random 32-bit keys. The time is averaged over 10 million updates to 
set with 1 million keys in linear probing table with 2 million entries. 

A. 5 Linear Probing 

We now consider what happens when we use the different hashing schemes with linear probing. 
In this case, the hash function needs good random properties are need for good performance on 
worst-case input. We consider 2 20 32-bit keys in a table with 2 21 entries. The table therefore uses 
8MB space, which does not fit in the cache of either computer, so there will be competition for the 
cache. Each experiment averaged the update time over 10 million insert /delete cycles. For each 
input type, we ran 100 such experiments on the same input but with different random seeds for 
the hash functions. 

Random input. First we consider the case of a random input, where the randomization properties 
of the hash function are irrelevant. This means that the focus is on speed just like when we ran 
the experiments in isolation. Essentially, the cost per update should be that of computing the hash 
value, as in Table [TJ plus a common additive cost: a random access to look up the hash entry plus 
a number of sequential probes. The average number of probes per update was tightly concentrated 
around 3.28 for all schemes, deviating by less than 0.02 over the 100 experiments. 

An interesting new issue is that the different schemes now have to compete with the linear 
probing table for the cache. In particular, this could hurt the tabulation based schemes. Another 
issue is that when schemes are integrated with an application, the optimizing compiler may have 
many more opportunities for pipelining etc. The results for random input are presented in Table 
[2j Within the 100 experiments, the deviation for each data point was less than 1%, and here we 
just present the mean. 

Compared with Table [TJ we see that our 32-bit computer performs better than the 64-bit 
computer on linear probing. In Table [T] we had that the 64-bit processor was twice as fast at 
the hash computations based on 64-bit multiplication, but in Table [2j when combined with linear 
probing, we see that it is only faster in the most extreme case of 5-indep-Mersenne-prime. One 
of the more surprising outcomes is that 5-indep-Mersenne-prime is so slow compared with the 
tabulation based schemes on the 64-bit computer. We had expected the tabulation based schemes 
to take a hit from cache competition, but the effect appears to be minor. 

The basic outcome is that simple tabulation in linear probing with random input is competitive 
with the fast multiplication-shift based scheme and about 20% faster than the fastest 5-independent 
scheme (which is much more complicated to implement). We note that we cannot hope for a big 
multiplicative gain in this case, since the cost is dominated by the common additive cost from 
working the linear probing table itself. 
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Structured input. We now consider the case where the input keys are structured in the sense of 
being drawn in random order from a dense interval: a commonly occurring case in practice which 
is known to cause unreliable performance for most simple hashing schemes |PPR09 PT10 , TZ09 



The results are shown in Figure [2} For each hashing scheme, we present the average number of 
probes for each of the 100 experiments as a cumulative distribution function. We see that simple 
tabulation and the 5-independent schemes remain tightly concentrated while the multiplication- 
shift schemes have significant variance, as observed also in TZ09| . This behavior is repeated in the 
timings on the two computers, but shifted due to difference in speed for the hash computations. 
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Figure 2: Keys from dense interval. The multiplication-shift schemes sometimes use far more 
probes, which also shows in much longer running times. 

Thus, among simple fast hashing schemes, simple tabulation stands out in not failing on a dense 
interval. Of course, it might be that simple tabulation had a different worst-case input. A plausible 
guess is that the worst instance of simple tabulation is the hypercube, which minimizes the amount 
of random table entries used. In our case, for 2 20 keys, we experimented with the set [32] 4 , i.e., we 
only use 32 values for each of the 4 characters. The results for the number of probes are presented 
in Figure [3j 

Thus, simple tabulation remains extremely robust and tightly concentrated, but once again the 
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Figure 3: Keys from hyper cube. 



multiplication-shift schemes fail (this time more often but less badly). The theoretical explanation 



from PT10 is that multiplication-shift fails on arithmetic sequences, and in the hypercube we have 
many different but shorter arithmetic sequences. It should be said that although it cannot be seen 
in the plots, the structured inputs did lead to more deviation in probes for simple tabulation: the 
deviation from 3.28 over 100 independent runs grew from below 0.5% with random input to almost 
1% with any of the structured input. 

Obviously, no experiment can confirm that simple tabulation is robust for all possible inputs. 
Our theoretical analysis implies strong concentration, e.g., in the sense of constant variance, yet the 
hidden constants are large. Our experiments suggest that the true constants are very reasonable. 

Cuckoo hashing. Our results show that the failure probability in constructing a cuckoo hashing 
table is 0(n -1 / 3 ). A pertinent question is whether the constants hidden by the O-notation are too 
high from a practical point of view. Experiments cannot conclusively prove that this constant is 
always small, since we do not know the worst instance. However, as for linear probing, a plausible 
guess that the instance eliciting the worst behavior is a hypercube: S = A c , for A C S. We made 
10 5 independent runs with the following input instances: 

32-bit keys: Tabulation uses c = 4 characters. We set A = [32], giving 32 4 = 2 20 keys in S. The 

empirical success probability was 99.4%. 
64-bit keys: Tabulation uses c = 8 characters. We set ^4 = 8, giving 8 8 = 2 24 keys in S. The 

empirical success probability was 97.1%. 

These experiments justify the belief that our scheme is effective in practice. 

It has already been shown conclusively that weaker multiplication schemes do not perform 



well. Dietzfelbinger and Schellbach DS09 show analytically that, when S is chosen uniformly at 
random from the universe [n 12 / 11 ] or smaller, cuckoo hashing with 2-independent multiplicative 
hashing fails with probability 1 — o(l). This is borne out in the experiments of [DS09| , which give 
failure probability close to 1 for random sets that are dense in the universe. On the other hand, 
the more complicated tabulation hashing of Thorup and Zhang |TZ04| will perform at least as well 
as simple tabulation (that algorithm is a superset of simple tabulation). 

A notable competitor to simple tabulation is a tailor-made tabulation hash function analyzed by 



Dietzfelbinger and Woelfel DW03 . This function uses two arrays of size r and four ri-independent 
hash functions to obtain failure probability n/r d l 2 . 
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Let us analyze the parameters needed in a practical implementation. If we want the same space 
as in simple tabulation, we can set r = 2 10 (this is larger than E = 256, because fewer tables are 
needed). For a nontrivial failure probability with sets of 2 20 keys, this would require 6-independence. 
In principle, the tabulation-based scheme of TZ04 can support 6-independence with 5c — 4 tables 
(and lookups). This scheme has not been implemented yet, but based on combinatorial complexity 
is expected to be at least twice as slow as the 5-independent scheme tested in Table [T] (i.e. 4-8 times 
slower than simple tabulation). Alternatively, we can compute four 6-independent hash functions 
using two polynomials of degree 5 on 64-bit values (e.g. modulo 2 61 — 1). Based on Table [TJ this 
would be two orders of magnitude slower than simple tabulation. With any of the alternatives, the 
tailor-made scheme is much more complicated to implement. 



B Chernoff Bounds with Fixed Means 

We will here formally establish that the standard Chernoff bounds hold if when each variable have 
a fixed mean even if the of the variables are not independent. Below shall use the notation that if 
we have variables xi,X2, then x<j = {xj}j<i. In particular, ^2 x <i = Ylj<i x j- 

Proposition 32. Consider n possibly dependent random variables X\, X2, • • • , X n € [0, 1] . Suppose 
for each i that EpQ] = \n is fixed no matter the values of X±, X{-\, that is, for any values 
xi, ...,Xi-\, Epfj|X<j = x<i\ = fj-i. Let X = Y2i Xi and fi = E[X] = ]T\ Then for any 5 > 0, 
the bounds are: 

Pr[X > (1 + 4)„] < ((jT^w)" P 'I- Y S <» - S >"1 S ((dW 

Proof. The proof is a simple generalization over the standard proof when the Xi are independent. 
We wish to bound the probability of X > (1 + 8)/j,. To do this we will prove that 

B[(l + 5) x ] < e". 

The proof will be by induction on n. Let 

B[(l + 6) x ] = <Pr[X <n = x <n ] xE[(l + 5) x \X <n = x <n ]) 

= (^i X <n = *<n] X (1 + 5)^ X <" X E [(1 + 6) X - I X <n = X <n ] ) . 



Now, for any random variable Y G [0,1], by convexity, 

E [(1 + 5) Y ] < E[Y](1 + 5) + 1 - E[Y] = 1 + 8E[Y] < e 5E[Y] 
Therefore, since E[X n \X <n = x <n ] = fi n for any value x <n of X <n , 

E [(1 + 5) Xn I X <n = x <n ] <e 5 ^. 
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Thus 

E[(l + <5) x ] = ^2(Pr[X <n = x <n ] x (l + «*)!><" x E [(1 + 5) x - \ X <n = x <r 

< Y J ( FT i X <n = X <n] X (! + <$)*><" Xe S ^ 



E J(1 + 5) EX< " 



x e 



The last inequality followed by induction. Finally, by Markov's inequality, we conclude that 



Pr[X>(l + ^]< < 



E[(l + *)*1 



<5 M / e S \V 



(1 + <5)(i+<% " (1 + 5)(i+<% V (1 + <5) (1+<5) , 
The case X < (1 — S)fi follows by a symmetric argument. □ 
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